mysql字符编码问题latin1到utf8

时间:2023-01-10 11:23:38

测试目的:

java从mysql读取数据并打印

测试环境1

a. hadoop集群某节点的环境是utf8,java代码也是utf8编码

b. 需要读取的mysql服,数据库、数据表均是latin1



运行mysql -u* -p* -A -h 进入mysql服

(一)查看mysql编码的基本方法

 先在mysql下确认原始数据是什么编码,经过下面三步可以确认原始数据是latin1

进入mysql > use db;

mysql> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | gbk                        |
| character_set_connection | gbk                        |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | gbk                        |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)


// set names的作用是改变三个红色部分的三个参数

mysql> set names 'latin1';
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

8 rows in set (0.00 sec)


Set names 'utf8'等价于下面三条语句

SET character_set_client = utf8 用来设置客户端送给MySQL服务器的数据的 字符集
SET character_set_results = utf8 服务器返回查询结果时使用的字符集
SET character_set_connection = utf8    MySQL 服务器 把客户端传来的数据,从character_set_client字符集转换成character_set_connection字符集

mysql的信息流向及编码转换

接收:client --> connection --> database

反馈:database--->connection-->results



mysql> SELECT LOWER(consumption_name),consumption FROM dimen_table;
+-------------------------+-------------+
| LOWER(consumption_name) | consumption |
+-------------------------+-------------+
| gm命令                |       -1019 |
| 神器解锁            |       -1018 |
| 购买经验药水      |       -1017 |
| 购买时装            |       -1016 |
| 运营活动(领物品) |       -1015 |
| 神器精炼(棍)       |       -1014 |
| 神器精炼(拳)       |       -1013 |
| 商店购买            |       -1012 |

(二) java下的代码测试

 测试1: java连接代码

jdbc:mysql://106.2.67.10/sdc_hdfs?useUnicode=true&characterEncoding=UTF-8

Statement statement = con.createStatement();

statement.execute("set names 'utf8'");//与参数characterEncoding=UTF-8的作用类似

sql = "SELECT LOWER(consumption_name),consumption FROM dimen_table"

ResultSet rs = statement.executeQuery(sql);

while (rs.next()) {

        System.out.println(new String(rs.getBytes(i+1),"cp1252")); // latin1 对应的解码用ISO-8859-1或者Cp1252,我测试的时候发现应该用cp1252

}

// 查看mysql支持的字符集及其描述

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |



【参考】

http://*.com/questions/21689665/mysql-latin1-to-utf-8-using-java-hibernate-jpa :

提到latin1应该用cp1252解码:

MySQL's version of latin1 is an extended version of CP1252: it uses the 5 bytes that CP1252 leaves undefined. Unfortunately the current Connector/J has a "bug" in that it uses the original CP1252 rather than MySQL's own version. Therefore it's impossible to recover strings whose encoding uses one of these 5 bytes. Patching the Connector/J source to fix the bug could solve the problem, but ideally you should migrate the tables to UTF-8.

提到修改sql语句的解决方法:SELECT CONVERT(CONVERT(CONVERT( column_name USING latin1) USING binary) using utf8) FROM...


测试2,改sql,将编码转换交给mysql来处理

url置为jdbc:mysql://123.*.*.108/db

sql置成SELECT CONVERT(CONVERT(CONVERT(LOWER(consumption_name) USING latin1) USING binary) USING utf8),consumption FROM dimen_table

执行sql前,先执行statement.execute("set names 'utf8'");

在ResultSet中,rs.getString()获得的就是utf-8编码


在windows下显示正常。

在linux下显示有问题。

gm命令  -1019
神器解锁        -1018
购买经验药水    -1017
购买时装        -1016
运营活动(领物-1015
神器精炼(-1014
神器精炼(-1013

商店购买        -1012

(说明:红色部分不能正常解析)


这个是linux环境的编码环境导致的:解决方法

1) 使用locale查看file.encoding这个系统变量,如果不是utf8可以运行这个命令 export LANG=zh_CN.utf8

或者这样LANG=zh_CN.utf8  java -Djava.ext.dir=/sdfls/asdlfjal/test Main执行之后

2) 或者java -Dfile.encoding=utf8 MainClass



测试3:修改测试环境

a. hadoop集群某节点的环境是utf8,java代码也是utf8编码

b. 将上面的latin1数据库的编码改为utf8(修改生效的判定:新建的表默认是utf8.修改之前默认的编码是latin1)

步骤及结果


url置为jdbc:mysql://123.*.*.108/db?CharSet=utf8&useUnicode=true&characterEncoding=utf8

sql还是普通写法:SELECT LOWER(consumption_name),consumption FROM info.dimen_table

在执行上面这个sql之前,查看字符集


System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 初始连接进去之后查看 字符集
/** 结果5
character_set_client    utf8
character_set_connection        utf8
character_set_database  utf8
character_set_filesystem        binary
character_set_results
<span style="color:#FF0000;">character_set_server    latin1</span>
character_set_system    utf8
character_sets_dir      /usr/share/mysql/charsets/
*/

statement.execute("set names 'utf8'"); //修改连接方式
//statement.execute("set character_set_server='utf8'");
System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 修改之后,查看连接使用的字符集
/** 结果6
character_set_client    utf8
character_set_connection        utf8
character_set_database  utf8
character_set_filesystem        binary
character_set_results
<span style="color:#FF0000;">character_set_server    utf8</span>
character_set_system    utf8
character_sets_dir      /usr/share/mysql/charsets/
*/

在ResultSet中,用utf8解释:new String(rs.getString(i).getBytes("UTF-8"));//     右边这个写法完全是乱码 new String(rs.getString(i).getBytes(),"UTF-8")

跟测试2的最终输出结果是一样的。


(对于纯粹的utf8环境:服务器编码是utf8;mysql的默认编码也是utf8,进入mysql之后查看字符集,会看到字符集展示就是结果6,即默认的character_set_server=utf8)

感觉所有问题的根源在于默认的character_set_server=latin1


http://*.com/questions/27866533/whacky-latin1-to-utf8-conversion-in-jdbc  :提到jdbc对不识别的latin1的编码字符插入了特殊的替换字符

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters



【参考】使用Java读写存储在latin1编码的MySQL中的UTF-8编码的中文

Character set bug at server with utf8 column and latin1 connection

Description: // This bug is reproduced using a MySQL Linux default installation where "character_set_server" is "latin1"‘


 其他相关资料及描述

貌似最好把所有字符集设置成utf8的方法,解决起来最彻底;并且这个问题只出现在jdbc中,使用python操作的时候,读取都是正常的

    conn10 = MySQLdb.connect(host=db10, user=mdUser, passwd=mdPasswd, db="sdc_hdfs" )
cursor10 = conn10.cursor()
cursor10.execute("set @@autocommit=1")
cursor10.execute("SHOW VARIABLES LIKE 'character_set_database'")
data=cursor10.fetchone()
if data[1]=='utf8':
cursor10.close()
cursor10 = MySQLdb.connect(db10, mdUser, mdPasswd, 'sdc_hdfs', charset='utf8', use_unicode=False).cursor() #使用utf8来连接
cursor10.execute("set @@autocommit=1")


Java中String解码、编码

Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:


byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.
byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.
String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".
String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.

Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).
  

Unicode - How to get the characters right?

latin1转gbk的乱码问题,jdbc的bug

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

最后发现:

1. 不同jdbc对编码的支持情况不一样

2. 测试2,其实能解决问题,也就是set names 'utf8'  + 在SQL里边使用convert()来转码是万能的: 能绕开不同jdbc带来的问题。