在Python 3中将字符串转换为字节的最佳方法是什么?

时间:2022-07-02 20:13:20

There appears to be two different ways to convert a string to bytes, as seen in the answers to TypeError: 'str' does not support the buffer interface

似乎有两种不同的方法可以将字符串转换为字节,如TypeError答案中所示:'str'不支持缓冲区接口

Which of these methods would be better or more Pythonic? Or is it just a matter of personal preference?

这些方法中哪一种更好,哪一种更python化?还是只是个人喜好的问题?

b = bytes(mystring, 'utf-8')

b = mystring.encode('utf-8')

5 个解决方案

#1


334  

If you look at the docs for bytes, it points you to bytearray:

如果你查看文档中的字节,它会指向bytearray:

bytearray([source[, encoding[, errors]]])

中bytearray([[来源,编码[、错误]]])

Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods.

返回一个新的字节数组。bytearray类型是一个可变的整数序列,范围为0 <= x < 256。它有大多数常见的可变序列方法(在可变序列类型中描述),以及字节类型具有的大多数方法(参见字节数组方法和字节数组方法)。

The optional source parameter can be used to initialize the array in a few different ways:

可选的源参数可以用几种不同的方式初始化数组:

If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode().

如果是字符串,您还必须提供编码(以及可选的错误)参数;bytearray()然后使用string .encode()将字符串转换为字节。

If it is an integer, the array will have that size and will be initialized with null bytes.

如果它是一个整数,那么数组将具有这个大小,并将使用空字节初始化。

If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array.

如果是符合缓冲区接口的对象,则将使用对象的只读缓冲区初始化字节数组。

If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array.

如果它是可迭代的,那么它必须是范围为0 <= x < 256的整数的可迭代性,这些整数用作数组的初始内容。

Without an argument, an array of size 0 is created.

没有参数,将创建一个大小为0的数组。

So bytes can do much more than just encode a string. It's Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.

所以字节可以做的不仅仅是编码一个字符串。它允许您使用任何类型的源参数调用构造函数。

For encoding a string, I think that some_string.encode(encoding) is more Pythonic than using the constructor, because it is the most self documenting -- "take this string and encode it with this encoding" is clearer than bytes(some_string, encoding) -- there is no explicit verb when you use the constructor.

对于编码一个字符串,我认为some_string.encode(编码)比使用构造函数更具有python性,因为它是最自文档化的——“取这个字符串并用这个编码对它进行编码”比bytes(some_string, encoding)更清晰——使用构造函数时没有显式的动词。

Edit: I checked the Python source. If you pass a unicode string to bytes using CPython, it calls PyUnicode_AsEncodedString, which is the implementation of encode; so you're just skipping a level of indirection if you call encode yourself.

编辑:我检查了Python源代码。如果使用CPython将unicode字符串传递给字节,它将调用PyUnicode_AsEncodedString,这是编码的实现;如果你调用你自己编码的话,你只是跳过了一个间接层次。

Also, see Serdalis' comment -- unicode_string.encode(encoding) is also more Pythonic because its inverse is byte_string.decode(encoding) and symmetry is nice.

另外,请参阅Serdalis的注释——unicode_string.encode(编码)也是python的,因为它的逆是byte_string.decode(编码)和对称是好的。

#2


141  

Its easier than it is thought:

这比想象的要容易:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
type(my_str_as_bytes) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
type(my_decoded_str) # ensure it is string representation

#3


31  

The absolutely best way is neither of the 2, but the 3rd. The first parameter to encode defaults to 'utf-8'. Thus the best way is

最好的方法不是2,而是3。第一个参数编码默认为“utf-8”。因此,最好的办法是

b = mystring.encode()

This will also be faster, because the default argument results not in the string "utf-8" in the C code, but NULL, which is much faster to check!

这也会更快,因为默认参数不会导致C代码中的字符串“utf-8”,而是NULL,检查速度要快得多!

Here be some timings:

这里是一些计时:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

Despite the warning the times were very stable after repeated runs - the deviation was just ~2 per cent.

尽管有警告,但在反复运行后,时间非常稳定——偏差仅为2%。

#4


23  

You can simply convert string to bytes using:

您可以简单地将字符串转换为字节:

a_string.encode()

a_string.encode()

and you can simply convert bytes to string using:

您可以简单地将字节转换为字符串,使用:

some_bytes.decode()

some_bytes.decode()

bytes.decode and str.encode have encoding='utf-8' as default value.

decode和string .encode将编码='utf-8'作为默认值。

The following functions (taken from Effective Python) might be useful to convert str to bytes and bytes to str:

下面的函数(取自有效的Python)可能有助于将str转换为字节和字节。

def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode() # uses 'utf-8' for encoding
    else:
        value = bytes_or_str
    return value # Instance of bytes


def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode() # uses 'utf-8' for encoding
    else:
        value = bytes_or_str
    return value # Instance of str

#5


9  

so_string = '*'
so_bytes = so_string.encode( )

#1


334  

If you look at the docs for bytes, it points you to bytearray:

如果你查看文档中的字节,它会指向bytearray:

bytearray([source[, encoding[, errors]]])

中bytearray([[来源,编码[、错误]]])

Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods.

返回一个新的字节数组。bytearray类型是一个可变的整数序列,范围为0 <= x < 256。它有大多数常见的可变序列方法(在可变序列类型中描述),以及字节类型具有的大多数方法(参见字节数组方法和字节数组方法)。

The optional source parameter can be used to initialize the array in a few different ways:

可选的源参数可以用几种不同的方式初始化数组:

If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode().

如果是字符串,您还必须提供编码(以及可选的错误)参数;bytearray()然后使用string .encode()将字符串转换为字节。

If it is an integer, the array will have that size and will be initialized with null bytes.

如果它是一个整数,那么数组将具有这个大小,并将使用空字节初始化。

If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array.

如果是符合缓冲区接口的对象,则将使用对象的只读缓冲区初始化字节数组。

If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array.

如果它是可迭代的,那么它必须是范围为0 <= x < 256的整数的可迭代性,这些整数用作数组的初始内容。

Without an argument, an array of size 0 is created.

没有参数,将创建一个大小为0的数组。

So bytes can do much more than just encode a string. It's Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.

所以字节可以做的不仅仅是编码一个字符串。它允许您使用任何类型的源参数调用构造函数。

For encoding a string, I think that some_string.encode(encoding) is more Pythonic than using the constructor, because it is the most self documenting -- "take this string and encode it with this encoding" is clearer than bytes(some_string, encoding) -- there is no explicit verb when you use the constructor.

对于编码一个字符串,我认为some_string.encode(编码)比使用构造函数更具有python性,因为它是最自文档化的——“取这个字符串并用这个编码对它进行编码”比bytes(some_string, encoding)更清晰——使用构造函数时没有显式的动词。

Edit: I checked the Python source. If you pass a unicode string to bytes using CPython, it calls PyUnicode_AsEncodedString, which is the implementation of encode; so you're just skipping a level of indirection if you call encode yourself.

编辑:我检查了Python源代码。如果使用CPython将unicode字符串传递给字节,它将调用PyUnicode_AsEncodedString,这是编码的实现;如果你调用你自己编码的话,你只是跳过了一个间接层次。

Also, see Serdalis' comment -- unicode_string.encode(encoding) is also more Pythonic because its inverse is byte_string.decode(encoding) and symmetry is nice.

另外,请参阅Serdalis的注释——unicode_string.encode(编码)也是python的,因为它的逆是byte_string.decode(编码)和对称是好的。

#2


141  

Its easier than it is thought:

这比想象的要容易:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
type(my_str_as_bytes) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
type(my_decoded_str) # ensure it is string representation

#3


31  

The absolutely best way is neither of the 2, but the 3rd. The first parameter to encode defaults to 'utf-8'. Thus the best way is

最好的方法不是2,而是3。第一个参数编码默认为“utf-8”。因此,最好的办法是

b = mystring.encode()

This will also be faster, because the default argument results not in the string "utf-8" in the C code, but NULL, which is much faster to check!

这也会更快,因为默认参数不会导致C代码中的字符串“utf-8”,而是NULL,检查速度要快得多!

Here be some timings:

这里是一些计时:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

Despite the warning the times were very stable after repeated runs - the deviation was just ~2 per cent.

尽管有警告,但在反复运行后,时间非常稳定——偏差仅为2%。

#4


23  

You can simply convert string to bytes using:

您可以简单地将字符串转换为字节:

a_string.encode()

a_string.encode()

and you can simply convert bytes to string using:

您可以简单地将字节转换为字符串,使用:

some_bytes.decode()

some_bytes.decode()

bytes.decode and str.encode have encoding='utf-8' as default value.

decode和string .encode将编码='utf-8'作为默认值。

The following functions (taken from Effective Python) might be useful to convert str to bytes and bytes to str:

下面的函数(取自有效的Python)可能有助于将str转换为字节和字节。

def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode() # uses 'utf-8' for encoding
    else:
        value = bytes_or_str
    return value # Instance of bytes


def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode() # uses 'utf-8' for encoding
    else:
        value = bytes_or_str
    return value # Instance of str

#5


9  

so_string = '*'
so_bytes = so_string.encode( )