UTF-8, PHP和XML Mysql。

时间:2022-10-24 22:44:11

I am having great problems solving this one:

我在解决这个问题上遇到了很大的困难:

I have a mysql database encoding latin1_swedish_ci and a table that stores names and addresses.

我有一个编码latin1_swedish_ci的mysql数据库和一个存储名称和地址的表。

I am trying to output a UTF-8 XML file, but I am having problems with the following string:

我正在尝试输出一个UTF-8 XML文件,但是我有以下字符串的问题:

Otivägen it is being outputted as Otivägen when i vim the file. Also when opened it IE i get

Otivagen它正在输出OtivA¤创当我vim文件。当我打开它时

"An invalid character was found in text content. Error processing resource"

在文本内容中发现无效字符。错误处理资源”

I have the following code:

我有以下代码:

function fixEncoding($in_str)
{
    $cur_encoding = mb_detect_encoding($in_str) ;
    if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
        return $in_str;
    else
        return utf8_encode($in_str);
}

header("Content-type: text/plain;charset=utf-8");
$mystring = "Otivägen" // this is actually obtained from database;

$myxml = "<myxml>
....
     <node>".$mystring."</node>
....
</myxml>
";
$myxml = fixEncoding($myxml);

The actual XML output is below:

实际的XML输出如下:

<?xml version="1.0" encoding="UTF-8" ?>
<myxml>
    ....
    <node>Otivägen</node>
    ....
</myxml>

Any ideas how I can output the file so in vim the file reads Otivägen and not Otivägen?

任何想法如何输出文件在vim文件读取Otivagen而不是OtivA¤创?

EDIT:

编辑:

I did mysql_client_encoding() and got latin1
I then did mysql_set_charset()
and again ran mysql_client_encoding() and got utf8, but still the same outputting issues.

我做了mysql_client_coding(),得到了latin1,然后做了mysql_set_charset(),再次运行了mysql_client_encoding(),得到了utf8,但输出问题仍然存在。

Edit 2

编辑2

I have logged into the command line and run the query SELECT address1 FROM address WHERE id = 1000;

我已经登录到命令行并运行查询SELECT address1,从id = 1000的地址;

SELECT address1 FROM address WHERE id = 1000;
Current database: ftpuser_db

+-------------+
|   address1  |
+-------------+
| Otivägen 32 |
+-------------+
1 row in set (0.06 sec)

Thanks in advance!

提前谢谢!

6 个解决方案

#1


1  

I think you did everything correctly, except that your terminal is in Latin-1.

我认为你做的一切都是对的,只是你的终端在Latin-1。

The UTF-8 sequence for ä is C3 A4, which is ä if displayed as Latin-1.

utf - 8序列是C3 A4,¤如果显示为latin - 1。

#2


2  

Is your MySQL connection encoding properly set to UTF-8 ?

你的MySQL连接编码是否设置为UTF-8 ?

Check mysql_set_charset() and mysql_client_encoding() for more details.

检查mysql_set_charset()和mysql_client_encoding()以获得更多的细节。

#3


2  

Oh boy. UTF8 issues can be a real pain and they get almost impossible to solve when something is doing re-encodings for you.

哦男孩。UTF8问题可能是非常痛苦的,当某些事情正在为您重新编码时,它们几乎不可能解决。

You really need to start at one end and make sure every process is UTF8. That will remove things in the process from interpreting the data wrong and 'converting' it for you. But significantly, it will also let you much more easily spot when something has already mis-encoded text for you (yes, I've had that problem).

您确实需要从一端开始,并确保每个进程都是UTF8。这样就可以避免在处理过程中错误地解释数据并为您“转换”数据。但值得注意的是,它还能让您更容易地发现已经有错误编码的文本(是的,我遇到过这个问题)。

And if you have UTF8 data in tables that aren't set to UTF8 and might be mis-encoded, you need to do the tables last, after the data has been re-encoded. Otherwise you will damage your data irretrievably. I've had that problem, too.

如果表中的UTF8数据没有设置为UTF8,并且可能被错误编码,那么在重新编码数据之后,您需要执行最后的表。否则,您将无法恢复您的数据。我也遇到过这个问题。

First steps:

第一步骤:

  • Check your terminal is UTF8 compliant. Gnome-terminal is. Kterm is. ETerm is not.
  • 检查你的终端是否兼容UTF8。gnome终端。Kterm。ETerm不是。
  • Check your LANG setting in your shell. It should probably have .UTF-8 on the end of it's value.
  • 检查shell中的LANG设置。它的值应该是。utf -8。
  • Check that vim is picking up the UTF8 setting correctly. You can check with :set encoding
  • 检查vim是否正确地获取UTF8设置。您可以使用:set编码进行检查

This will mean that your files will be edited in UTF8.

这意味着您的文件将在UTF8中进行编辑。

Now we check MySQL.

现在我们检查MySQL。

In the MySQL CLI, do show variables like 'character_set%';. The results will probably be something like:

在MySQL CLI中,请显示'character_set%'等变量。结果大概是这样的:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

What you're aiming for is to change all those latin1 values (or whatever you're seeing) to utf8.

您的目标是将所有这些latin1值(或者您所看到的)更改为utf8。

set names utf8; will change most of them and you might need to do that with every new connection in your database. This was the solution I had to adopt in a previous application. The other settings to change are in the my.cnf file for which I need to direct you to the documentation. It is unlikely you will need to set them all.

设置名字utf8;将更改其中的大部分,您可能需要在数据库中的每一个新连接中执行这些操作。这是我在以前的应用程序中必须采用的解决方案。其他要更改的设置在my.cnf文件中,我需要将其指向文档。你不太可能需要把它们都设置好。

I see you're already setting the output headers, so that's good.

我看到你已经设置了输出头,这很好。

Now you can look at the data from the database and see why it's "wrong".

现在,您可以查看数据库中的数据,看看为什么它是“错误的”。

#4


0  

latin1_swedish_ci is a collation, not a charset. Since collations are supposed to match their charset, it suggests that the table is using latin1, but it's not a guarantee.

latin1_swedish_ci是一个排序,而不是一个字符集。由于排序集应该与它们的字符集匹配,这表明该表正在使用latin1,但这不是一个保证。

Strictly speaking, the charset of tables is irrelevant here, since MySql can convert input/output. That's what the connection charset (mysql_set_charset) is for. However, for that to work properly, the data needs to be encoded properly in the database. I would begin by checking that strings are correct in the database. Simplest thing is to log in on the command line and select a row which has non-ascii characters in it. Does it look OK?

严格地说,表的字符集在这里是不相关的,因为MySql可以转换输入/输出。这就是连接字符集(mysql_set_charset)的用途。但是,要使其正常工作,需要在数据库中正确地编码数据。我将首先检查数据库中的字符串是否正确。最简单的事情是登录到命令行并选择包含非ascii字符的行。它看起来好吗?

$mystring = "Otivägen" // this is actually obtained from database;

Watch out. The encoding of the data in $mystring will now depend on the encoding of the php file. That may or may not be the same as the data in the database.

小心。$mystring中数据的编码现在将取决于php文件的编码。这可能与数据库中的数据相同,也可能不相同。

#5


0  

before output run query SET NAMES utf8

在输出运行之前,查询设置名称为utf8

after output you can go back and run SET NAMES latin1

输出后,可以返回并运行SET名称latin1

Look here, I've got the same problem

听着,我也有同样的问题

#6


0  

It seems you are "double encoding" Otivägen. You get this behaviour if Otivägen already is UTF-8, and run utf8_encode() on it again. Example:

看来你是在“双重编码”Otivagen。如果Otivagen已经是UTF-8,则会得到这种行为,并再次在其上运行utf8_encode()。例子:

$str = "Otivägen"; // already an UTF-8 string
echo utf8_encode($str); // outputs Otivägen

I'm not sure we're the actual "double encoding" occurs, but it may be due to settings in your editor. My theory. Lets say you are running Aptana Studio: Your actual character set is set to ISO-8859-1 (in Aptana, you can check this by right clicking on a file and choose "properties". To set default character encoding for all projects, choose Preferences from Aptana main menu -> General -> workspace). If that's the case, the actual PHP source file where you have $myxml and its string <myxml><node>... is detected to be ISO-8859-1, but $mystring received from the database is UTF-8. Your fixEncoding function would then run the else clause, since the $myxml as a whole is seen as ISO-8859-1 and not UTF-8. This results in double encoding the results from the database, and may be the cause to your problem.

我不确定是否出现了真正的“双重编码”,但这可能是由于您的编辑器中的设置。我的理论。假设您正在运行Aptana Studio:您的实际字符集设置为ISO-8859-1(在Aptana中,您可以通过右键单击一个文件并选择“properties”来检查它。要为所有项目设置默认的字符编码,请从Aptana主菜单-> General ->工作区中选择首选项。如果是这样,实际的PHP源文件中有$myxml及其字符串 …检测到是ISO-8859-1,但是从数据库接收的$mystring是UTF-8。然后,您的fixEncoding函数将运行else子句,因为$myxml作为一个整体被看作ISO-8859-1,而不是UTF-8。这将导致对来自数据库的结果进行双重编码,并可能导致您的问题。

Check the encoding of your actual source file in your editor, and verify that it is set to UTF-8. Alternatively, experiment with applying or removing fixEncoding/utf8_encode/utf8_decode to $myxml. Observe the results and see what needs to be done to the value Otivägen right.

检查编辑器中实际源文件的编码,并验证它是否设置为UTF-8。或者,尝试应用或删除固定编码/utf8_encode/utf8_decode到$myxml。观察结果,看看需要做什么,以价值为目的正确。

#1


1  

I think you did everything correctly, except that your terminal is in Latin-1.

我认为你做的一切都是对的,只是你的终端在Latin-1。

The UTF-8 sequence for ä is C3 A4, which is ä if displayed as Latin-1.

utf - 8序列是C3 A4,¤如果显示为latin - 1。

#2


2  

Is your MySQL connection encoding properly set to UTF-8 ?

你的MySQL连接编码是否设置为UTF-8 ?

Check mysql_set_charset() and mysql_client_encoding() for more details.

检查mysql_set_charset()和mysql_client_encoding()以获得更多的细节。

#3


2  

Oh boy. UTF8 issues can be a real pain and they get almost impossible to solve when something is doing re-encodings for you.

哦男孩。UTF8问题可能是非常痛苦的,当某些事情正在为您重新编码时,它们几乎不可能解决。

You really need to start at one end and make sure every process is UTF8. That will remove things in the process from interpreting the data wrong and 'converting' it for you. But significantly, it will also let you much more easily spot when something has already mis-encoded text for you (yes, I've had that problem).

您确实需要从一端开始,并确保每个进程都是UTF8。这样就可以避免在处理过程中错误地解释数据并为您“转换”数据。但值得注意的是,它还能让您更容易地发现已经有错误编码的文本(是的,我遇到过这个问题)。

And if you have UTF8 data in tables that aren't set to UTF8 and might be mis-encoded, you need to do the tables last, after the data has been re-encoded. Otherwise you will damage your data irretrievably. I've had that problem, too.

如果表中的UTF8数据没有设置为UTF8,并且可能被错误编码,那么在重新编码数据之后,您需要执行最后的表。否则,您将无法恢复您的数据。我也遇到过这个问题。

First steps:

第一步骤:

  • Check your terminal is UTF8 compliant. Gnome-terminal is. Kterm is. ETerm is not.
  • 检查你的终端是否兼容UTF8。gnome终端。Kterm。ETerm不是。
  • Check your LANG setting in your shell. It should probably have .UTF-8 on the end of it's value.
  • 检查shell中的LANG设置。它的值应该是。utf -8。
  • Check that vim is picking up the UTF8 setting correctly. You can check with :set encoding
  • 检查vim是否正确地获取UTF8设置。您可以使用:set编码进行检查

This will mean that your files will be edited in UTF8.

这意味着您的文件将在UTF8中进行编辑。

Now we check MySQL.

现在我们检查MySQL。

In the MySQL CLI, do show variables like 'character_set%';. The results will probably be something like:

在MySQL CLI中,请显示'character_set%'等变量。结果大概是这样的:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

What you're aiming for is to change all those latin1 values (or whatever you're seeing) to utf8.

您的目标是将所有这些latin1值(或者您所看到的)更改为utf8。

set names utf8; will change most of them and you might need to do that with every new connection in your database. This was the solution I had to adopt in a previous application. The other settings to change are in the my.cnf file for which I need to direct you to the documentation. It is unlikely you will need to set them all.

设置名字utf8;将更改其中的大部分,您可能需要在数据库中的每一个新连接中执行这些操作。这是我在以前的应用程序中必须采用的解决方案。其他要更改的设置在my.cnf文件中,我需要将其指向文档。你不太可能需要把它们都设置好。

I see you're already setting the output headers, so that's good.

我看到你已经设置了输出头,这很好。

Now you can look at the data from the database and see why it's "wrong".

现在,您可以查看数据库中的数据,看看为什么它是“错误的”。

#4


0  

latin1_swedish_ci is a collation, not a charset. Since collations are supposed to match their charset, it suggests that the table is using latin1, but it's not a guarantee.

latin1_swedish_ci是一个排序,而不是一个字符集。由于排序集应该与它们的字符集匹配,这表明该表正在使用latin1,但这不是一个保证。

Strictly speaking, the charset of tables is irrelevant here, since MySql can convert input/output. That's what the connection charset (mysql_set_charset) is for. However, for that to work properly, the data needs to be encoded properly in the database. I would begin by checking that strings are correct in the database. Simplest thing is to log in on the command line and select a row which has non-ascii characters in it. Does it look OK?

严格地说,表的字符集在这里是不相关的,因为MySql可以转换输入/输出。这就是连接字符集(mysql_set_charset)的用途。但是,要使其正常工作,需要在数据库中正确地编码数据。我将首先检查数据库中的字符串是否正确。最简单的事情是登录到命令行并选择包含非ascii字符的行。它看起来好吗?

$mystring = "Otivägen" // this is actually obtained from database;

Watch out. The encoding of the data in $mystring will now depend on the encoding of the php file. That may or may not be the same as the data in the database.

小心。$mystring中数据的编码现在将取决于php文件的编码。这可能与数据库中的数据相同,也可能不相同。

#5


0  

before output run query SET NAMES utf8

在输出运行之前,查询设置名称为utf8

after output you can go back and run SET NAMES latin1

输出后,可以返回并运行SET名称latin1

Look here, I've got the same problem

听着,我也有同样的问题

#6


0  

It seems you are "double encoding" Otivägen. You get this behaviour if Otivägen already is UTF-8, and run utf8_encode() on it again. Example:

看来你是在“双重编码”Otivagen。如果Otivagen已经是UTF-8,则会得到这种行为,并再次在其上运行utf8_encode()。例子:

$str = "Otivägen"; // already an UTF-8 string
echo utf8_encode($str); // outputs Otivägen

I'm not sure we're the actual "double encoding" occurs, but it may be due to settings in your editor. My theory. Lets say you are running Aptana Studio: Your actual character set is set to ISO-8859-1 (in Aptana, you can check this by right clicking on a file and choose "properties". To set default character encoding for all projects, choose Preferences from Aptana main menu -> General -> workspace). If that's the case, the actual PHP source file where you have $myxml and its string <myxml><node>... is detected to be ISO-8859-1, but $mystring received from the database is UTF-8. Your fixEncoding function would then run the else clause, since the $myxml as a whole is seen as ISO-8859-1 and not UTF-8. This results in double encoding the results from the database, and may be the cause to your problem.

我不确定是否出现了真正的“双重编码”,但这可能是由于您的编辑器中的设置。我的理论。假设您正在运行Aptana Studio:您的实际字符集设置为ISO-8859-1(在Aptana中,您可以通过右键单击一个文件并选择“properties”来检查它。要为所有项目设置默认的字符编码,请从Aptana主菜单-> General ->工作区中选择首选项。如果是这样,实际的PHP源文件中有$myxml及其字符串 …检测到是ISO-8859-1,但是从数据库接收的$mystring是UTF-8。然后,您的fixEncoding函数将运行else子句,因为$myxml作为一个整体被看作ISO-8859-1,而不是UTF-8。这将导致对来自数据库的结果进行双重编码,并可能导致您的问题。

Check the encoding of your actual source file in your editor, and verify that it is set to UTF-8. Alternatively, experiment with applying or removing fixEncoding/utf8_encode/utf8_decode to $myxml. Observe the results and see what needs to be done to the value Otivägen right.

检查编辑器中实际源文件的编码,并验证它是否设置为UTF-8。或者,尝试应用或删除固定编码/utf8_encode/utf8_decode到$myxml。观察结果,看看需要做什么,以价值为目的正确。