utf-8编码的文本带有额外的字符,如何?

时间:2023-01-05 21:07:30

Data is coming to my app via an XML with utf-8 encoded data. The text that the user inputs is saved in the XML and then my app reads it.

数据通过带有utf-8编码数据的XML进入我的应用程序。用户输入的文本保存在XML中,然后我的应用程序读取它。

Recently it failed when the user wrote one special character at the end. The result is that in the XML every character has an extra 0x40 character before it. So instead of receiving:

最近,当用户在末尾写了一个特殊字符时,它失败了。结果是,在XML中,每个字符在它之前都有一个额外的0x40字符。所以收到:

67 6f 20 61 68 65 61 64 (go ahead)

it received:

收到:

40 67 40 6f 40 20 40 61 40 68 40 65 40 61 40 64 (@g@o@ @a@h@e@a@d)

what went wrong?

到底是哪里出了错?

0x40 in binary is 01000000 which makes me thing that 1 is some sort of control bit and it came in a different encoding...

二进制中的0x40是01000000,这让我觉得1是某种控制位,它采用了不同的编码……

2 个解决方案

#1


0  

If I am understanding correctly you are saying the payload is a string of supposedly UTF-8 bytes. i.e.

如果我理解正确的话,您是说有效负载是一个假定为UTF-8字节的字符串。即。

40 62 20 C6 40 62

But this is not valid UTF-8. The C6 corrupts it. In UTF-8 it is never valid to have only one byte > 0x80. You can see this if you paste the above (space sperated bytes) into my little conversion utility Use the UFT-8 bytes field).

但这不是有效的UTF-8。C6腐败。在UTF-8中,只有一个字节> 0x80是无效的。如果您使用UFT-8字节字段将上面的(空格加字节)粘贴到我的小转换工具中,您就可以看到这一点。

http://sodved.awardspace.info/unicode.pl

http://sodved.awardspace.info/unicode.pl

So I suspect whichever tool/library you are using is encountering the invalid UTF-8 data and is then trying some other way of processing it. In none of the standard encodings syngle byte is C6 a curly quote. And C6 is not valid in GSM7bit (http://www.developershome.com/sms/gsmAlphabet.asp).

因此我怀疑您使用的任何工具/库都遇到了无效的UTF-8数据,然后正在尝试其他处理方法。在所有的标准编码中,没有一个是C6。而C6在GSM7bit中无效(http://www.developershome.com/sms/gsmAlphabet.asp)。

So you real problem is what is it doing there? Are you sure about the encoding of the payload? Even in the GSM7 default alphabet without the C6 it seems weird

你真正的问题是它在那里做什么?你确定有效载荷的编码吗?即使是在没有C6的GSM7默认字母表中,看起来也很奇怪

¡b ¡b

#2


0  

The bytes 40 62 20 C6 40 62 are not valid utf-8! The problem is the orphaned startbyte C6. C6 is in dual 11000110 so it is a startbyte of a 2-byte sequence (because it begins with 110, the remaining 5 bits are payload bits of the codepoint which is 110). But the following byte for the startbyte is missing, so this is an illegal 2-byte sequence! Possibly the bytes are NOT utf-encoded and the C6 is an ANSI character e. g. a single character. However C6 is higher than 127 and so not an ASCII character. Every character higher than 127 must be decoded with a proper utf-8 sequence when encoding to utf-8.

字节40 62 20 C6 40 62不是有效的utf-8!问题是孤儿startbyte C6。C6是双11000110,所以它是一个2字节序列的startbyte(因为它以110开头,剩下的5位是编码点的有效负载位,即110)。但是startbyte的下一个字节丢失了,所以这是一个非法的2字节序列!可能字节不是utf编码的,C6是ANSI字符,例如单个字符。但是C6高于127,所以不是ASCII字符。当编码到utf-8时,每个大于127的字符都必须用合适的utf-8序列进行解码。

#1


0  

If I am understanding correctly you are saying the payload is a string of supposedly UTF-8 bytes. i.e.

如果我理解正确的话,您是说有效负载是一个假定为UTF-8字节的字符串。即。

40 62 20 C6 40 62

But this is not valid UTF-8. The C6 corrupts it. In UTF-8 it is never valid to have only one byte > 0x80. You can see this if you paste the above (space sperated bytes) into my little conversion utility Use the UFT-8 bytes field).

但这不是有效的UTF-8。C6腐败。在UTF-8中,只有一个字节> 0x80是无效的。如果您使用UFT-8字节字段将上面的(空格加字节)粘贴到我的小转换工具中,您就可以看到这一点。

http://sodved.awardspace.info/unicode.pl

http://sodved.awardspace.info/unicode.pl

So I suspect whichever tool/library you are using is encountering the invalid UTF-8 data and is then trying some other way of processing it. In none of the standard encodings syngle byte is C6 a curly quote. And C6 is not valid in GSM7bit (http://www.developershome.com/sms/gsmAlphabet.asp).

因此我怀疑您使用的任何工具/库都遇到了无效的UTF-8数据,然后正在尝试其他处理方法。在所有的标准编码中,没有一个是C6。而C6在GSM7bit中无效(http://www.developershome.com/sms/gsmAlphabet.asp)。

So you real problem is what is it doing there? Are you sure about the encoding of the payload? Even in the GSM7 default alphabet without the C6 it seems weird

你真正的问题是它在那里做什么?你确定有效载荷的编码吗?即使是在没有C6的GSM7默认字母表中,看起来也很奇怪

¡b ¡b

#2


0  

The bytes 40 62 20 C6 40 62 are not valid utf-8! The problem is the orphaned startbyte C6. C6 is in dual 11000110 so it is a startbyte of a 2-byte sequence (because it begins with 110, the remaining 5 bits are payload bits of the codepoint which is 110). But the following byte for the startbyte is missing, so this is an illegal 2-byte sequence! Possibly the bytes are NOT utf-encoded and the C6 is an ANSI character e. g. a single character. However C6 is higher than 127 and so not an ASCII character. Every character higher than 127 must be decoded with a proper utf-8 sequence when encoding to utf-8.

字节40 62 20 C6 40 62不是有效的utf-8!问题是孤儿startbyte C6。C6是双11000110,所以它是一个2字节序列的startbyte(因为它以110开头,剩下的5位是编码点的有效负载位,即110)。但是startbyte的下一个字节丢失了,所以这是一个非法的2字节序列!可能字节不是utf编码的,C6是ANSI字符,例如单个字符。但是C6高于127,所以不是ASCII字符。当编码到utf-8时,每个大于127的字符都必须用合适的utf-8序列进行解码。