PostgreSQL + PHP + UTF8 = invalid byte sequence for encoding

I'm migrating a db from mysql to postgresql. The mysql db's default collation is UTF8, postgres is also using UTF8, and I'm encoding the data with pg_escape_string(). For whatever reason however, I'm running into some funky errors about bad encoding:

我正在将数据库从mysql迁移到postgresql。 mysql db的默认排序规则是UTF8，postgres也使用UTF8，我用pg_escape_string（）编码数据。无论出于何种原因，我遇到了一些关于错误编码的时髦错误：

pg_query() [function.pg-query]: Query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xeb7374 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client"

pg_query（）[function.pg-query]：查询失败：ERROR：编码“UTF8”的无效字节序列：0xeb7374提示：如果字节序列与服务器所期望的编码不匹配，也会发生此错误由“客户”

I've been poking around trying to figure this out, and noticed that php is doing something weird; if a string has only ascii chars in it (eg. "hello"), the encoding is ASCII. If the string contains any non ascii chars, it says the encoding is UTF8 (eg. "Hëllo").

我一直在试图解决这个问题，并注意到php正在做一些奇怪的事情;如果一个字符串中只包含ascii字符（例如“hello”），则编码为ASCII。如果字符串包含任何非ascii字符，则表示编码为UTF8（例如“Hëllo”）。

When I use utf8_encode() on strings that are already UTF8, it kills the special chars and makes them all messed up, so.. what can I do to get this to work?

当我在已经是UTF8的字符串上使用utf8_encode（）时，它会杀死特殊的字符并使它们全部搞砸了，所以..我能做些什么才能使它工作？

(the exact char hanging it up right now is "�", but instead of just search/replace, i'd like to find a better solution so this kinda problem doesn't happen again)

（现在把它挂起的确切字符是“ ”，但不是只搜索/替换，我想找到一个更好的解决方案，所以这种问题不再发生）

2 个解决方案

#1

Most likely, the data in your MySQL database isn't UTF8. It's a pretty common scenario. MySQL at least used to not do any proper validation at all on the data, so it accepted anything you threw at it as UTF8 as long as your client claimed it was UTF8. They may have fixed that by now (or not, I don't know if they even consider it a problem), but you may already have incorrectly encoded data in the db. PostgreSQL, of course, performs full validation when you load it, and thus it may fail.

最有可能的是，MySQL数据库中的数据不是UTF8。这是一个很常见的场景。 MySQL至少习惯于对数据进行任何适当的验证，因此只要您的客户声称它是UTF8，它就会接受您作为UTF8提交的任何内容。他们可能已经解决了这个问题（或者不是，我不知道他们是否认为这是一个问题），但你可能已经在db中编码了错误的数据。当然，PostgreSQL在加载时会执行完整的验证，因此可能会失败。

You may want to feed the data through something like iconv that can be set to ignore unknown characters, or transform them to "best guess".

您可能希望通过像iconv这样可以设置为忽略未知字符的数据来提供数据，或者将它们转换为“最佳猜测”。

#2

BTW, an ASCII string is exactly the same in UTF-8 because they share the same first 127 characters; so "Hello" in ASCII is exactly the same as "Hello" in UTF-8, there's no conversion needed.

顺便说一句，ASCII字符串在UTF-8中完全相同，因为它们共享相同的前127个字符;所以ASCII中的“Hello”与UTF-8中的“Hello”完全相同，不需要转换。

The collation in the table may be UTF-8 but you may not be fetching information from it in the same encoding. Now if you have trouble with information you give to pg_escape_string it's probably because you're assuming content fetched from MySQL is encoded in UTF-8 while it's not. I suggest you look at this page on MySQL documentation and see the encoding of your connection; you're probably fetching from a table where the collation is UTF-8 but you're connection is something like Latin-1 (where special characters such as çéèêöà etc won't be encoded in UTF-8).

表中的排序规则可能是UTF-8，但您可能无法以相同的编码从中获取信息。现在，如果您对pg_escape_string提供的信息有问题，可能是因为您假设从MySQL获取的内容是以UTF-8编码的，而不是。我建议你看看MySQL文档的这个页面，看看你的连接的编码;你可能从一个表格中获取整数是UTF-8，但是你的连接就像拉丁语-1（其中特殊字符如çéèêöà等不会用UTF-8编码）。

#1