用Python编写UTF-8字符串到MySQL。

时间:2023-01-05 23:16:43

I am trying to push user account data from an Active Directory to our MySQL-Server. This works flawlessly but somehow the strings end up showing an encoded version of umlauts and other special characters.

我正在尝试将用户帐户数据从Active Directory推送到MySQL-Server。这是完美无缺的,但不知怎么的,字符串最终显示了一个编码版本的umlauts和其他特殊字符。

The Active Directory returns a string using this sample format: M\xc3\xbcller

Active Directory使用此示例格式返回一个字符串:M\xc3\xbcller。

This actually is the UTF-8 encoding for Müller, but I want to write Müller to my database not M\xc3\xbcller.

这实际上是Muller的UTF-8编码,但我想把Muller写进我的数据库,而不是M\xc3\xbcller。

I tried converting the string with this line, but it results in the same string in the database: tempEntry[1] = tempEntry[1].decode("utf-8")

我尝试用这一行来转换字符串,但是它在数据库中产生了相同的字符串:tempEntry[1] = tempEntry[1].decode(“utf-8”)

If I run print "M\xc3\xbcller".decode("utf-8") in the python console the output is correct.

如果我在python控制台运行print“M\xc3\xbcller”.decode(“utf-8”),输出是正确的。

Is there any way to insert this string the right way? I need this specific format for a web developer who wants to have this exact format, I don't know why he is not able to convert the string using PHP directly.

有什么方法可以正确插入这个字符串吗?我需要这种特定格式的web开发人员想要有这种格式,我不知道为什么他不能直接使用PHP转换字符串。

Additional info: I am using MySQLdb; The table and column encoding is utf8_general_ci

附加信息:我正在使用MySQLdb;表和列编码是utf8_general_ci。

8 个解决方案

#1


41  

As @marr75 suggests, make sure you set charset='utf8' on your connections. Setting use_unicode=True is not strictly necessary as it is implied by setting the charset.

正如@marr75所建议的,确保您在连接上设置了charset='utf8'。设置use_unicode=True并不是必须的,因为它是通过设置charset来实现的。

Then make sure you are passing unicode objects to your db connection as it will encode it using the charset you passed to the cursor. If you are passing a utf8-encoded string, it will be doubly encoded when it reaches the database.

然后确保将unicode对象传递给数据库连接,因为它将使用您传递给游标的charset进行编码。如果您传递的是utf8编码的字符串,那么当它到达数据库时,它将被加倍编码。

So, something like:

所以,类似:

conn = MySQLdb.connect(host="localhost", user='root', password='', db='', charset='utf8')
data_from_ldap = 'M\xc3\xbcller'
name = data_from_ldap.decode('utf8')
cursor = conn.cursor()
cursor.execute(u"INSERT INTO mytable SET name = %s", (name,))

You may also try forcing the connection to use utf8 by passing the init_command param, though I'm unsure if this is required. 5 mins testing should help you decide.

您还可以尝试通过传递init_command param来强制连接使用utf8,不过我不确定是否需要这样做。5分钟的测试可以帮助你做出决定。

conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')

Also, and this is barely worth mentioning as 4.1 is so old, make sure you are using MySQL >= 4.1

而且,这一点也不值得一提,因为4.1太老了,请确保使用的是MySQL >= 4.1。

#2


15  

Assuming you are using MySQLdb you need to pass use_unicode=True and charset="utf8" when creating your connection.

假设您正在使用MySQLdb,您需要在创建连接时传递use_unicode=True和charset="utf8"。

UPDATE: If I run the following against a test table I get -

更新:如果我在一个测试表中运行以下操作,我将得到-。

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
>>> c = db.cursor()
>>> c.execute("INSERT INTO last_names VALUES(%s)", (u'M\xfcller', ))
1L
>>> c.execute("SELECT * FROM last_names")
1L
>>> print c.fetchall()
(('M\xc3\xbcller',),)

This is "the right way", the characters are being stored and retrieved correctly, your friend writing the php script just isn't handling the encoding correctly when outputting.

这是“正确的方式”,字符被正确地存储和检索,您的朋友编写php脚本在输出时没有正确地处理编码。

As Rob points out, use_unicode and charset combined is being verbose about the connection, but I have a natural paranoia about even the most useful python libraries outside of the standard library so I try to be explicit to make bugs easy to find if the library changes.

正如Rob所指出的那样,use_unicode和charset的组合是关于连接的详细信息,但是我对标准库之外的最有用的python库有一种天生的偏执,所以我试图明确地让bug更容易发现,如果库发生了变化。

#3


8  

I found the solution to my problems. Decoding the String with .decode('unicode_escape').encode('iso8859-1').decode('utf8') did work at last. Now everything is inserted as it should. The full other solution can be found here: Working with unicode encoded Strings from Active Directory via python-ldap

我找到了解决问题的办法。解码字符串与.decode('unicode_escape').编码('iso8859-1').decode('utf8')最终完成工作。现在所有的东西都被插入了。可以在这里找到完整的其他解决方案:通过python-ldap在Active Directory中使用unicode编码的字符串。

#4


8  

import MySQLdb

# connect to the database
db = MySQLdb.connect("****", "****", "****", "****") #don't use charset here

# setup a cursor object using cursor() method
cursor = db.cursor()

cursor.execute("SET NAMES utf8mb4;") #or utf8 or any other charset you want to handle

cursor.execute("SET CHARACTER SET utf8mb4;") #same as above

cursor.execute("SET character_set_connection=utf8mb4;") #same as above

# run a SQL question
cursor.execute("****")

#and make sure the MySQL settings are correct, data too

#5


5  

Recently I had the same issue with field value being a byte string instead of unicode. Here's a little analysis.

最近,我遇到了同样的问题,字段值是一个字节字符串,而不是unicode。这里有一个小的分析。

Overview

In general all one needs to do to have unicode values from a cursor, is to pass charset argument to connection constructor and have non-binary table fields (e.g. utf8_general_ci). Passing use_unicode is useless because it is set to true whenever charset has a value.

一般来说,所有人都需要从游标中获得unicode值,就是将charset参数传递给连接构造函数,并拥有非二进制表字段(例如utf8_general_ci)。传递use_unicode是无用的,因为当字符集具有值时,它将被设置为true。

MySQLdb respects cursor description field types, so if you have a DATETIME column in cursor the values will be converted to Python datatime.datetime instances, DECIMAL to decimal.Decimal and so on, but binary values will be represented as is, by byte strings. Most of decoders are defined in MySQLdb.converters, and one can override them on instance basis by providing conv argument to connection constructor.

MySQLdb考虑了游标描述字段类型,因此,如果您在游标中有一个DATETIME列,那么这些值将转换为Python datatime。datetime实例,十进制到十进制。十进制等等,但是二进制的值是用字节字符串表示的。大多数解码器是在MySQLdb中定义的。转换器,并且可以通过向连接构造函数提供conv参数来在实例基础上覆盖它们。

But unicode decoders are an exception here, which is likely a design shortcoming. They are appended directly to connection instance converters in its constructor. So it's only possible to override them on instance-basic.

但是unicode解码器是一个例外,这可能是一个设计缺陷。它们在其构造函数中直接附加到连接实例转换器。所以只能在实例基础上覆盖它们。

Workaround

Let's see the issue code.

让我们看看问题代码。

import MySQLdb

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
cursor     = connection.cursor()

cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")

print cursor.fetchone() 
# (u'abcd\u0451', 'abcd\xd1\x91')
print cursor.description 
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags 
# (1, 0)

It shows that b field is returned as a byte string instead of unicode. However it is not binary, MySQLdb.constants.FLAG.BINARY & cursor.description_flags[1] (MySQLdb field flags). It seems like bug in the library (opened #90). But the reason for it I see as MySQLdb.constants.FIELD_TYPE.LONG_BLOB (cursor.description[1][1] == 251, MySQLdb field types) just hasn't a converter at all.

它显示,b字段作为字节字符串返回,而不是unicode。但是,它不是二进制的,mysqldb .constant . flag。BINARY & cursor.description_flags[1] (MySQLdb字段标志)。好像是图书馆里的臭虫(打开了#90)。但我认为它的原因是MySQLdb.constants.FIELD_TYPE。LONG_BLOB (cursor.description[1][1] == 251, MySQLdb字段类型)根本就没有转换器。

import MySQLdb
import MySQLdb.converters as conv
import MySQLdb.constants as const

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
connection.converter[const.FIELD_TYPE.LONG_BLOB] = connection.converter[const.FIELD_TYPE.BLOB]
cursor = connection.cursor()

cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")

print cursor.fetchone()
# (u'abcd\u0451', u'abcd\u0451')
print cursor.description
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags
# (1, 0)

Thus by manipulating connection instance converter dict, it is possible to achieve desired unicode decoding behaviour.

因此,通过操作连接实例转换器命令,可以实现所需的unicode解码行为。

If you want to override the behaviour here's how a dict entry for possible text field looks like after constructor.

如果您想要重写此行为,那么在构造函数之后,对于可能的文本字段的命令条目是怎样的。

import MySQLdb
import MySQLdb.constants as const

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
print connection.converter[const.FIELD_TYPE.BLOB]
# [(128, <type 'str'>), (None, <function string_decoder at 0x7fa472dda488>)]

MySQLdb.constants.FLAG.BINARY == 128. This means that if a field has binary flag it will be str, otherwise unicode decoder will be applied. So you want to try to convert binary values as well, you can pop the first tuple.

MySQLdb.constants.FLAG。二进制= = 128。这意味着如果一个字段有二进制标志,那么它将是str,否则将应用unicode解码器。所以你也要尝试转换二进制值,你可以弹出第一个元组。

#6


2  

(Would like to reply to above answer but do not have enough reputation...)

(想回复以上答案,但没有足够的信誉…)

The reason why you don't get unicode results in this case:

在这种情况下,您没有得到unicode结果的原因是:

>>> print c.fetchall()
(('M\xc3\xbcller',),)

is a bug from MySQLdb 1.2.x with *_bin collation, see:

是MySQLdb 1.2的一个错误。x与*_bin排序,参见:

http://sourceforge.net/tracker/index.php?func=detail&aid=1693363&group_id=22307&atid=374932 http://sourceforge.net/tracker/index.php?func=detail&aid=2663436&group_id=22307&atid=374932

http://sourceforge.net/tracker/index.php?func = detail&aid = 1693363 &group_id = 1693363 atid = 374932 http://sourceforge.net/tracker/index.php?func=detail&aid=2663436&group_id=22307&atid=374932

In this particular case (collation utf8_bin - or [anything]_bin...) you have to expect the "raw" value, here utf-8 (yes, this sucks as there is no generic fix).

在这个特殊的例子中(collation utf8_bin -或[任何]_bin…)您必须期望“原始”值,这里是utf-8(是的,这很糟糕,因为没有通用的修复)。

#7


0  

and db.set_character_set('utf8'), imply that use_unicode=True ?

set_character_set('utf8'),意味着use_unicode=True ?

#8


0  

there is another situation maybe a little rare.

还有一种情况可能有点罕见。

if you create a schema in mysqlworkbench firstly,you will get the encoding error and can't solve it by add charset configuration.

如果您首先在mysqlworkbench中创建一个模式,那么您将得到编码错误,并不能通过添加charset配置来解决它。

it is because mysqlworkbench create schema by latin1 by default, so you should set the charset at first! 用Python编写UTF-8字符串到MySQL。

因为mysqlworkbench默认是由latin1创建模式的,所以您应该首先设置charset !

#1


41  

As @marr75 suggests, make sure you set charset='utf8' on your connections. Setting use_unicode=True is not strictly necessary as it is implied by setting the charset.

正如@marr75所建议的,确保您在连接上设置了charset='utf8'。设置use_unicode=True并不是必须的,因为它是通过设置charset来实现的。

Then make sure you are passing unicode objects to your db connection as it will encode it using the charset you passed to the cursor. If you are passing a utf8-encoded string, it will be doubly encoded when it reaches the database.

然后确保将unicode对象传递给数据库连接,因为它将使用您传递给游标的charset进行编码。如果您传递的是utf8编码的字符串,那么当它到达数据库时,它将被加倍编码。

So, something like:

所以,类似:

conn = MySQLdb.connect(host="localhost", user='root', password='', db='', charset='utf8')
data_from_ldap = 'M\xc3\xbcller'
name = data_from_ldap.decode('utf8')
cursor = conn.cursor()
cursor.execute(u"INSERT INTO mytable SET name = %s", (name,))

You may also try forcing the connection to use utf8 by passing the init_command param, though I'm unsure if this is required. 5 mins testing should help you decide.

您还可以尝试通过传递init_command param来强制连接使用utf8,不过我不确定是否需要这样做。5分钟的测试可以帮助你做出决定。

conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')

Also, and this is barely worth mentioning as 4.1 is so old, make sure you are using MySQL >= 4.1

而且,这一点也不值得一提,因为4.1太老了,请确保使用的是MySQL >= 4.1。

#2


15  

Assuming you are using MySQLdb you need to pass use_unicode=True and charset="utf8" when creating your connection.

假设您正在使用MySQLdb,您需要在创建连接时传递use_unicode=True和charset="utf8"。

UPDATE: If I run the following against a test table I get -

更新:如果我在一个测试表中运行以下操作,我将得到-。

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
>>> c = db.cursor()
>>> c.execute("INSERT INTO last_names VALUES(%s)", (u'M\xfcller', ))
1L
>>> c.execute("SELECT * FROM last_names")
1L
>>> print c.fetchall()
(('M\xc3\xbcller',),)

This is "the right way", the characters are being stored and retrieved correctly, your friend writing the php script just isn't handling the encoding correctly when outputting.

这是“正确的方式”,字符被正确地存储和检索,您的朋友编写php脚本在输出时没有正确地处理编码。

As Rob points out, use_unicode and charset combined is being verbose about the connection, but I have a natural paranoia about even the most useful python libraries outside of the standard library so I try to be explicit to make bugs easy to find if the library changes.

正如Rob所指出的那样,use_unicode和charset的组合是关于连接的详细信息,但是我对标准库之外的最有用的python库有一种天生的偏执,所以我试图明确地让bug更容易发现,如果库发生了变化。

#3


8  

I found the solution to my problems. Decoding the String with .decode('unicode_escape').encode('iso8859-1').decode('utf8') did work at last. Now everything is inserted as it should. The full other solution can be found here: Working with unicode encoded Strings from Active Directory via python-ldap

我找到了解决问题的办法。解码字符串与.decode('unicode_escape').编码('iso8859-1').decode('utf8')最终完成工作。现在所有的东西都被插入了。可以在这里找到完整的其他解决方案:通过python-ldap在Active Directory中使用unicode编码的字符串。

#4


8  

import MySQLdb

# connect to the database
db = MySQLdb.connect("****", "****", "****", "****") #don't use charset here

# setup a cursor object using cursor() method
cursor = db.cursor()

cursor.execute("SET NAMES utf8mb4;") #or utf8 or any other charset you want to handle

cursor.execute("SET CHARACTER SET utf8mb4;") #same as above

cursor.execute("SET character_set_connection=utf8mb4;") #same as above

# run a SQL question
cursor.execute("****")

#and make sure the MySQL settings are correct, data too

#5


5  

Recently I had the same issue with field value being a byte string instead of unicode. Here's a little analysis.

最近,我遇到了同样的问题,字段值是一个字节字符串,而不是unicode。这里有一个小的分析。

Overview

In general all one needs to do to have unicode values from a cursor, is to pass charset argument to connection constructor and have non-binary table fields (e.g. utf8_general_ci). Passing use_unicode is useless because it is set to true whenever charset has a value.

一般来说,所有人都需要从游标中获得unicode值,就是将charset参数传递给连接构造函数,并拥有非二进制表字段(例如utf8_general_ci)。传递use_unicode是无用的,因为当字符集具有值时,它将被设置为true。

MySQLdb respects cursor description field types, so if you have a DATETIME column in cursor the values will be converted to Python datatime.datetime instances, DECIMAL to decimal.Decimal and so on, but binary values will be represented as is, by byte strings. Most of decoders are defined in MySQLdb.converters, and one can override them on instance basis by providing conv argument to connection constructor.

MySQLdb考虑了游标描述字段类型,因此,如果您在游标中有一个DATETIME列,那么这些值将转换为Python datatime。datetime实例,十进制到十进制。十进制等等,但是二进制的值是用字节字符串表示的。大多数解码器是在MySQLdb中定义的。转换器,并且可以通过向连接构造函数提供conv参数来在实例基础上覆盖它们。

But unicode decoders are an exception here, which is likely a design shortcoming. They are appended directly to connection instance converters in its constructor. So it's only possible to override them on instance-basic.

但是unicode解码器是一个例外,这可能是一个设计缺陷。它们在其构造函数中直接附加到连接实例转换器。所以只能在实例基础上覆盖它们。

Workaround

Let's see the issue code.

让我们看看问题代码。

import MySQLdb

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
cursor     = connection.cursor()

cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")

print cursor.fetchone() 
# (u'abcd\u0451', 'abcd\xd1\x91')
print cursor.description 
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags 
# (1, 0)

It shows that b field is returned as a byte string instead of unicode. However it is not binary, MySQLdb.constants.FLAG.BINARY & cursor.description_flags[1] (MySQLdb field flags). It seems like bug in the library (opened #90). But the reason for it I see as MySQLdb.constants.FIELD_TYPE.LONG_BLOB (cursor.description[1][1] == 251, MySQLdb field types) just hasn't a converter at all.

它显示,b字段作为字节字符串返回,而不是unicode。但是,它不是二进制的,mysqldb .constant . flag。BINARY & cursor.description_flags[1] (MySQLdb字段标志)。好像是图书馆里的臭虫(打开了#90)。但我认为它的原因是MySQLdb.constants.FIELD_TYPE。LONG_BLOB (cursor.description[1][1] == 251, MySQLdb字段类型)根本就没有转换器。

import MySQLdb
import MySQLdb.converters as conv
import MySQLdb.constants as const

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
connection.converter[const.FIELD_TYPE.LONG_BLOB] = connection.converter[const.FIELD_TYPE.BLOB]
cursor = connection.cursor()

cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")

print cursor.fetchone()
# (u'abcd\u0451', u'abcd\u0451')
print cursor.description
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags
# (1, 0)

Thus by manipulating connection instance converter dict, it is possible to achieve desired unicode decoding behaviour.

因此,通过操作连接实例转换器命令,可以实现所需的unicode解码行为。

If you want to override the behaviour here's how a dict entry for possible text field looks like after constructor.

如果您想要重写此行为,那么在构造函数之后,对于可能的文本字段的命令条目是怎样的。

import MySQLdb
import MySQLdb.constants as const

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
print connection.converter[const.FIELD_TYPE.BLOB]
# [(128, <type 'str'>), (None, <function string_decoder at 0x7fa472dda488>)]

MySQLdb.constants.FLAG.BINARY == 128. This means that if a field has binary flag it will be str, otherwise unicode decoder will be applied. So you want to try to convert binary values as well, you can pop the first tuple.

MySQLdb.constants.FLAG。二进制= = 128。这意味着如果一个字段有二进制标志,那么它将是str,否则将应用unicode解码器。所以你也要尝试转换二进制值,你可以弹出第一个元组。

#6


2  

(Would like to reply to above answer but do not have enough reputation...)

(想回复以上答案,但没有足够的信誉…)

The reason why you don't get unicode results in this case:

在这种情况下,您没有得到unicode结果的原因是:

>>> print c.fetchall()
(('M\xc3\xbcller',),)

is a bug from MySQLdb 1.2.x with *_bin collation, see:

是MySQLdb 1.2的一个错误。x与*_bin排序,参见:

http://sourceforge.net/tracker/index.php?func=detail&aid=1693363&group_id=22307&atid=374932 http://sourceforge.net/tracker/index.php?func=detail&aid=2663436&group_id=22307&atid=374932

http://sourceforge.net/tracker/index.php?func = detail&aid = 1693363 &group_id = 1693363 atid = 374932 http://sourceforge.net/tracker/index.php?func=detail&aid=2663436&group_id=22307&atid=374932

In this particular case (collation utf8_bin - or [anything]_bin...) you have to expect the "raw" value, here utf-8 (yes, this sucks as there is no generic fix).

在这个特殊的例子中(collation utf8_bin -或[任何]_bin…)您必须期望“原始”值,这里是utf-8(是的,这很糟糕,因为没有通用的修复)。

#7


0  

and db.set_character_set('utf8'), imply that use_unicode=True ?

set_character_set('utf8'),意味着use_unicode=True ?

#8


0  

there is another situation maybe a little rare.

还有一种情况可能有点罕见。

if you create a schema in mysqlworkbench firstly,you will get the encoding error and can't solve it by add charset configuration.

如果您首先在mysqlworkbench中创建一个模式,那么您将得到编码错误,并不能通过添加charset配置来解决它。

it is because mysqlworkbench create schema by latin1 by default, so you should set the charset at first! 用Python编写UTF-8字符串到MySQL。

因为mysqlworkbench默认是由latin1创建模式的,所以您应该首先设置charset !