如何将字符串转换成Python 2中的字节?

时间:2023-01-04 20:39:32

I know this may sounds like a duplicate question, but that's because I don't know how to describe this question properly.

我知道这听起来像是一个重复的问题,但那是因为我不知道如何恰当地描述这个问题。

For some reason I got a bunch of unicode string like this:

出于某种原因,我得到了一串这样的unicode字符串:

a = u'\xcb\xea'

As you can see, it's actually bytes representation of a Chinese character, encoding in gbk

正如您所看到的,它实际上是一个中文字符的字节表示,编码在gbk中。

>>> print(b'\xcb\xea'.decode('gbk'))
岁

u'岁' is what I need, but I don't know how to convert u'\xcb\xea' to b'\xcb\xea'.
Any suggestions?

u“岁”是我所需要的东西,但我不知道如何将u ' \ xcb \ xea ' b ' \ xcb \ xea '。有什么建议吗?

2 个解决方案

#1


4  

It's not really a bytes representation, it's still unicode codepoints. They are the wrong codepoints, because it was decoded from bytes as if it was encoded to Latin-1.

它不是一个字节表示,它仍然是unicode编码点。它们是错误的代码点,因为它是从字节中解码的,就好像它被编码到Latin-1。

Encode to Latin 1 (whose codepoints map one-on-one to bytes), then decode as GBK:

编码到拉丁1(它的代码点与字节进行一对一的映射),然后解码为GBK:

a.encode('latin1').decode('gbk')

Demo:

演示:

>>> a = u'\xcb\xea'
>>> a.encode('latin1').decode('gbk')
u'\u5c81'
>>> print a.encode('latin1').decode('gbk')
岁

#2


0  

The simpliest way for python2 is to use the repr():

python2最简单的方法是使用repr():

>>> key_unicode = u'uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1'
>>> key_ascii = 'uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1'
>>> print(key_ascii)
uuuu��_��9ԣ�
>>> print(key_unicode)
uuuuö_¡ë9Ô£Ñ
>>>
>>> # here is the save method for both string types:
>>> print(repr(key_ascii).lstrip('u')[1:-1])
uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
>>> print(repr(key_unicode).lstrip('u')[1:-1])
uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
>>> # ____________WARNING!______________
>>> # if you will use jsut `str.strip('u\'\"')`, you will lose
>>> # the "uuuu" (and quotes, if such are present) on sides of the string:
>>> print(repr(key_unicode).strip('u\'\"'))
\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1

For python3 use str.encode() to get the bytes type.

对于python3,使用str.encode()来获取字节类型。

>>> key = 'l\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1q\xf5L\xa9\xdd0\x90\x8b\xf5ht\x86za\x0e\x1b\xed\xb6(\xaa+'
>>> key
'lö\x9f_¡\x05ë9Ô£ÑqõL©Ý0\x90\x8bõht\x86za\x0e\x1bí¶(ª+'
>>> print(key)
lö_¡ë9Ô£ÑqõL©Ý0õhtzaí¶(ª+
>>> print(repr(key.encode()).lstrip('b')[1:-1])
l\xc3\xb6\xc2\x9f_\xc2\xa1\x05\xc3\xab9\xc3\x94\xc2\xa3\xc3\x91

#1


4  

It's not really a bytes representation, it's still unicode codepoints. They are the wrong codepoints, because it was decoded from bytes as if it was encoded to Latin-1.

它不是一个字节表示,它仍然是unicode编码点。它们是错误的代码点,因为它是从字节中解码的,就好像它被编码到Latin-1。

Encode to Latin 1 (whose codepoints map one-on-one to bytes), then decode as GBK:

编码到拉丁1(它的代码点与字节进行一对一的映射),然后解码为GBK:

a.encode('latin1').decode('gbk')

Demo:

演示:

>>> a = u'\xcb\xea'
>>> a.encode('latin1').decode('gbk')
u'\u5c81'
>>> print a.encode('latin1').decode('gbk')
岁

#2


0  

The simpliest way for python2 is to use the repr():

python2最简单的方法是使用repr():

>>> key_unicode = u'uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1'
>>> key_ascii = 'uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1'
>>> print(key_ascii)
uuuu��_��9ԣ�
>>> print(key_unicode)
uuuuö_¡ë9Ô£Ñ
>>>
>>> # here is the save method for both string types:
>>> print(repr(key_ascii).lstrip('u')[1:-1])
uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
>>> print(repr(key_unicode).lstrip('u')[1:-1])
uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
>>> # ____________WARNING!______________
>>> # if you will use jsut `str.strip('u\'\"')`, you will lose
>>> # the "uuuu" (and quotes, if such are present) on sides of the string:
>>> print(repr(key_unicode).strip('u\'\"'))
\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1

For python3 use str.encode() to get the bytes type.

对于python3,使用str.encode()来获取字节类型。

>>> key = 'l\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1q\xf5L\xa9\xdd0\x90\x8b\xf5ht\x86za\x0e\x1b\xed\xb6(\xaa+'
>>> key
'lö\x9f_¡\x05ë9Ô£ÑqõL©Ý0\x90\x8bõht\x86za\x0e\x1bí¶(ª+'
>>> print(key)
lö_¡ë9Ô£ÑqõL©Ý0õhtzaí¶(ª+
>>> print(repr(key.encode()).lstrip('b')[1:-1])
l\xc3\xb6\xc2\x9f_\xc2\xa1\x05\xc3\xab9\xc3\x94\xc2\xa3\xc3\x91