如何用python解码utf-8的字符串代表?

时间:2023-01-04 19:52:32

I have a unicode like this:

我有一个像这样的unicode:

\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7

And I know it is the string representative of bytes which is encoded with utf-8

我知道这是用utf-8编码的字节代表字符串

Note that the string \xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7 itself is <type 'unicode'>

请注意,字符串\ xE5 \ xB1 \ xB1 \ xE4 \ xB8 \ x9C \ xE6 \ x97 \ xA5 \ xE7 \ x85 \ xA7本身是 'unicode'>

How to decode it to the real string 山东 日照 ?

如何将其解码为真正的字符串山东日照?

1 个解决方案

#1


7  

If you printed the repr() output of your unicode string then you appear to have a Mojibake, bytes data decoded using the wrong encoding.

如果您打印了unicode字符串的repr()输出,那么您似乎有一个Mojibake,使用错误的编码解码字节数据。

First encode back to bytes, then decode using the right codec. This may be as simple as encoding as Latin-1:

首先编码回字节,然后使用正确的编解码器进行解码。这可能像编码Latin-1一样简单:

unicode_string.encode('latin1').decode('utf8')

This depends on how the incorrect decoding was applied however. If a Windows codepage (like CP1252) was used, you can end up with Unicode data that is not actually encodable back to CP1252 if UTF-8 bytes outside the CP1252 range were force-decoded anyway.

这取决于如何应用不正确的解码。如果使用Windows代码页(如CP1252),如果CP1252范围之外的UTF-8字节无论如何都被强制解码,您最终可能无法将可编码的Unicode数据反馈回CP1252。

The best way to repair such mistakes is using the ftfy library, which knows how to deal with forced-decoded Mojibake texts for a variety of codecs.

修复此类错误的最佳方法是使用ftfy库,该库知道如何处理各种编解码器的强制解码Mojibake文本。

For your small sample, Latin-1 appears to work just fine:

对于您的小样本,Latin-1似乎工作得很好:

>>> unicode_string = u'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> print unicode_string.encode('latin1').decode('utf8')
山东 日照
>>> import ftfy
>>> print ftfy.fix_text(unicode_string)
山东 日照

If you have the literal character \, x, followed by two digits, you have another layer of encoding where the bytes where replaced by 4 characters each. You'd have to 'decode' those to actual bytes first, by asking Python to interpret the escapes with the string_escape codec:

如果你有文字字符\,x,后跟两个数字,你有另一层编码,其中每个字节被4个字符替换。您必须首先通过要求Python使用string_escape编解码器解释转义来“解码”那些实际的字节:

>>> unicode_string = ur'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> unicode_string
u'\\xE5\\xB1\\xB1\\xE4\\xB8\\x9C \\xE6\\x97\\xA5\\xE7\\x85\\xA7'
>>> print unicode_string.decode('string_escape').decode('utf8')
山东 日照

'string_escape' is a Python 2 only codec that produces a bytestring, so it is safe to decode that as UTF-8 afterwards.

'string_escape'是一个只生成字节串的Python 2编解码器,因此可以安全地将其解码为UTF-8。

#1


7  

If you printed the repr() output of your unicode string then you appear to have a Mojibake, bytes data decoded using the wrong encoding.

如果您打印了unicode字符串的repr()输出,那么您似乎有一个Mojibake,使用错误的编码解码字节数据。

First encode back to bytes, then decode using the right codec. This may be as simple as encoding as Latin-1:

首先编码回字节,然后使用正确的编解码器进行解码。这可能像编码Latin-1一样简单:

unicode_string.encode('latin1').decode('utf8')

This depends on how the incorrect decoding was applied however. If a Windows codepage (like CP1252) was used, you can end up with Unicode data that is not actually encodable back to CP1252 if UTF-8 bytes outside the CP1252 range were force-decoded anyway.

这取决于如何应用不正确的解码。如果使用Windows代码页(如CP1252),如果CP1252范围之外的UTF-8字节无论如何都被强制解码,您最终可能无法将可编码的Unicode数据反馈回CP1252。

The best way to repair such mistakes is using the ftfy library, which knows how to deal with forced-decoded Mojibake texts for a variety of codecs.

修复此类错误的最佳方法是使用ftfy库,该库知道如何处理各种编解码器的强制解码Mojibake文本。

For your small sample, Latin-1 appears to work just fine:

对于您的小样本,Latin-1似乎工作得很好:

>>> unicode_string = u'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> print unicode_string.encode('latin1').decode('utf8')
山东 日照
>>> import ftfy
>>> print ftfy.fix_text(unicode_string)
山东 日照

If you have the literal character \, x, followed by two digits, you have another layer of encoding where the bytes where replaced by 4 characters each. You'd have to 'decode' those to actual bytes first, by asking Python to interpret the escapes with the string_escape codec:

如果你有文字字符\,x,后跟两个数字,你有另一层编码,其中每个字节被4个字符替换。您必须首先通过要求Python使用string_escape编解码器解释转义来“解码”那些实际的字节:

>>> unicode_string = ur'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> unicode_string
u'\\xE5\\xB1\\xB1\\xE4\\xB8\\x9C \\xE6\\x97\\xA5\\xE7\\x85\\xA7'
>>> print unicode_string.decode('string_escape').decode('utf8')
山东 日照

'string_escape' is a Python 2 only codec that produces a bytestring, so it is safe to decode that as UTF-8 afterwards.

'string_escape'是一个只生成字节串的Python 2编解码器,因此可以安全地将其解码为UTF-8。