lua UTF8字符串操作，截取，索引

首先引用网络一段说明

UTF-8是一种变长字节编码方式。对于某一个字符的UTF-8编码，如果只有一个字节则其最高二进制位为0；如果是多字节，其第一个字节从最高位开始，连续的二进制位值为1的个数决定了其编码的位数，其余各字节均以10开头。UTF-8最多可用到6个字节。

如表：
1字节 0xxxxxxx
2字节 110xxxxx 10xxxxxx
3字节 1110xxxx 10xxxxxx 10xxxxxx
4字节 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5字节 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6字节 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
因此UTF-8中可以用来表示字符编码的实际位数最多有31位，即上表中x所表示的位。除去那些控制位（每字节开头的10等），这些x表示的位与UNICODE编码是一一对应的，位高低顺序也相同。
实际将UNICODE转换为UTF-8编码时应先去除高位0，然后根据所剩编码的位数决定所需最小的UTF-8编码位数。
因此那些基本ASCII字符集中的字符（UNICODE兼容ASCII）只需要一个字节的UTF-8编码（7个二进制位）便可以表示。

对于上面的问题，代码中给出的两个字节是
十六进制：C0 B1
二进制：11000000 10110001
对比两个字节编码的表示方式：
110xxxxx 10xxxxxx
提取出对应的UNICODE编码：
00000 110001
可以看出此编码并非“标准”的UTF-8编码，因为其第一个字节的“有效编码”全为0，去除高位0后的编码仅有6位。由前面所述，此字符仅用一个字节的UTF-8编码表示就够了。

如果是多字节，其第一个字节从最高位开始，连续的二进制位值为1的个数决定了其编码的位数，其余各字节均以10开头。UTF-8最多可用到6个字节。

上面的表对应的10进制数

1字节 0xxxxxxx ---- 最小值 00000000 ---十进制为0 最大值---01111111 --- 十进制为 127

2字节 110xxxxx 10xxxxxx ----第一个字节最小值：11000000（将x替换成最小值0） --- 十进制为 192 ，最大值 11011111---十进制223 。其他字段范围10000000 ~ 10111111 ，十进制为 128 ~ 191

3字节 1110xxxx 10xxxxxx 10xxxxxx ----第一个字节最小值： 11100000 ---- 十进制为 224 。其他字段范围10000000 ~ 10111111 ，十进制为 128 ~ 191

4字节 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx ----第一个字节最小值： 11110000 ----十进制为 240 。其他字段范围10000000 ~ 10111111 ，十进制为 128 ~ 191

5字节 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx ----第一个字节最小值： 11111000 ----十进制为 248 。其他字段范围10000000 ~ 10111111 ，十进制为 128 ~ 191

6字节 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx ----第一个字节最小值： 11111100 ----十进制为 252 。其他字段范围10000000 ~ 10111111 ，十进制为 128 ~ 191

从上面可以看出第一个字节的取值范围

1字节 00000000 ---0 ~ 01111111 ---127

2字节 11000000 ---192~11011111---223

3字节 11100000 ---244~11101111---239

4字节 11110000 ---240~11110111---247

5字节 11111000 ---248~11111011---251

6字节 11111100 ---252~11111101---253

其他字节 10000000 ---128~10111111 ----191

lua实现utf8操作

方法导向：从上面看出，我们可以将一个字符的编码分为首部（字符编码的第一个字节）和尾部（字符编码除第一个字节的其他字节）

判断字符所占字节数

1字节：如果是在 0~ 127 内

2字节：如果是在 192~ 244 范围内

3字节：如果是在 244~ 480 范围内

4字节：如果是在 480~ 496 范围内

5字节：如果是在 496~ 504 范围内

6字节:如果是大于 504

还有一个关键点：在十进制为 128 ~ 191 范围的字节为此字符的一部分

代码如下：

--判断字符所占字节数

function byteNumber(coding)

if 127 >= coding then

return 1

elseif coding < 192 then

return 0

elseif coding < 224 then

return 2

elseif coding < 240 then

return 3

elseif coding < 248 then

return 4

elseif coding < 252 then

return 5

else

return 6

end

--截取从n到le的字符串

function string.utf8Sub(s, n, le)

if s ~= nil then

if tostring(type(s)) == "string" then

if n ==nil then

n = 1

else

if tostring(type(n)) ~= "number" or n < 1 then

n = 1

end

if le == nil then

le = 1

else

if tostring(type(le)) ~= "number" or le < 1 then

le = 1

end

local index = 0

local startIndex = 0

local endIndex = 0

for i = 1 , #s do

local coding = string.byte(s,i)

if coding >= 128 and coding < 192 then

else

index = index + 1

if index == n then

startIndex = i

end

if index == le then

endIndex = i + byteNumber(coding) - 1

end

return string.sub(s,startIndex,endIndex)

end

else

return nil

end

--获取第n个字符

function string.utf8Index(s,n)

if s ~= nil then

if tostring(type(s)) == "string" then

if n ==nil then

n = 1

else

if tostring(type(n)) ~= "number" or n < 1 then

n = 1

end

local index = 0

local startIndex = 0

for i = 1 , #s do

local coding = string.byte(s,i)

if coding >= 128 and coding < 192 then

else

index = index + 1

if index == n then

return string.sub(s,i,i + byteNumber(coding) - 1)

end

else

return nil

end

--获取字符串长度

function string.utf8Len(s)

if s ~= nil then

if tostring(type(s)) == "string" then

local index = 0

for i = 1 , #s do

local coding = string.byte(s,i)

if coding >= 128 and coding < 192 then

else

index = index + 1

end

return index

end

else

return nil

end

--以下是不需要传入字符串的方法

--如：local str = "截取字符串" str = str:utf8SelfSub(1,2) --输出str为"截取"

function string:utf8SelfSub(n, le)

if self ~= nil then

if tostring(type(self)) == "string" then

if n ==nil then

n = 1

else

if tostring(type(n)) ~= "number" or n < 1 then

n = 1

end

if le == nil then

le = 1

else

if tostring(type(le)) ~= "number" or le < 1 then

le = 1

end

local index = 0

local startIndex = 0

local endIndex = 0

for i = 1 , #self do

local coding = string.byte(self,i)

if coding >= 128 and coding < 192 then

else

index = index + 1

if index == n then

startIndex = i

end

if index == le then

endIndex = i + byteNumber(coding) - 1

end

return string.sub(self,startIndex,endIndex)

end

else

return nil

end

function string:utf8SelfIndex(n)

if self ~= nil then

if tostring(type(self)) == "string" then

if n ==nil then

n = 1

else

if tostring(type(n)) ~= "number" or n < 1 then

n = 1

end

local index = 0

local startIndex = 0

for i = 1 , #self do

local coding = string.byte(self,i)

if coding >= 128 and coding < 192 then

else

index = index + 1

if index == n then

return string.sub(self,i,i + byteNumber(coding) - 1)

end

else

return nil

end

function string:utf8SelfLen()

if self ~= nil then

if tostring(type(self)) == "string" then

local index = 0

for i = 1 , #self do

local coding = string.byte(self,i)

if coding >= 128 and coding < 192 then

else

index = index + 1

end

return index

end

else

return nil

end

秒客网

lua UTF8字符串操作，截取，索引

相关文章