mysql字符编码问题latin1到utf8

测试目的：

java从mysql读取数据并打印

测试环境1

a. hadoop集群某节点的环境是utf8，java代码也是utf8编码

b. 需要读取的mysql服，数据库、数据表均是latin1

运行mysql -u* -p* -A -h 进入mysql服

（一）查看mysql编码的基本方法

先在mysql下确认原始数据是什么编码，经过下面三步可以确认原始数据是latin1

进入mysql > use db;

// set names的作用是改变三个红色部分的三个参数

8 rows in set (0.00 sec)

Set names 'utf8'等价于下面三条语句

SET character_set_client = utf8 用来设置客户端送给MySQL服务器的数据的字符集
SET character_set_results = utf8 服务器返回查询结果时使用的字符集
SET character_set_connection = utf8 MySQL 服务器把客户端传来的数据，从character_set_client字符集转换成character_set_connection字符集

mysql的信息流向及编码转换

接收：client --> connection --> database

反馈：database--->connection-->results

mysql> SELECT LOWER(consumption_name),consumption FROM dimen_table;
+-------------------------+-------------+
| LOWER(consumption_name) | consumption |
+-------------------------+-------------+
| gm命令                |       -1019 |
| 神器解锁            |       -1018 |
| 购买经验药水      |       -1017 |
| 购买时装            |       -1016 |
| 运营活动(领物品) |       -1015 |
| 神器精炼(棍)       |       -1014 |
| 神器精炼(拳)       |       -1013 |
| 商店购买            |       -1012 |

（二） java下的代码测试

测试1： java连接代码

jdbc:mysql://106.2.67.10/sdc_hdfs?useUnicode=true&characterEncoding=UTF-8

Statement statement = con.createStatement();

statement.execute("set names 'utf8'");//与参数characterEncoding=UTF-8的作用类似

sql = "SELECT LOWER(consumption_name),consumption FROM dimen_table"

ResultSet rs = statement.executeQuery(sql);

while (rs.next()) {

System.out.println(new String(rs.getBytes(i+1),"cp1252")); // latin1 对应的解码用ISO-8859-1或者Cp1252，我测试的时候发现应该用cp1252

}

// 查看mysql支持的字符集及其描述

【参考】

http://*.com/questions/21689665/mysql-latin1-to-utf-8-using-java-hibernate-jpa ：

提到latin1应该用cp1252解码：

MySQL's version of latin1 is an extended version of CP1252: it uses the 5 bytes that CP1252 leaves undefined. Unfortunately the current Connector/J has a "bug" in that it uses the original CP1252 rather than MySQL's own version. Therefore it's impossible to recover strings whose encoding uses one of these 5 bytes. Patching the Connector/J source to fix the bug could solve the problem, but ideally you should migrate the tables to UTF-8.

提到修改sql语句的解决方法：SELECT CONVERT(CONVERT(CONVERT( column_name USING latin1) USING binary) using utf8) FROM...

测试2，改sql，将编码转换交给mysql来处理

url置为jdbc:mysql://123.*.*.108/db

sql置成SELECT CONVERT(CONVERT(CONVERT(LOWER(consumption_name) USING latin1) USING binary) USING utf8),consumption FROM dimen_table

执行sql前，先执行statement.execute("set names 'utf8'");

在ResultSet中，rs.getString()获得的就是utf-8编码

在windows下显示正常。

在linux下显示有问题。

gm命令 -1019
神器解锁        -1018
购买经验药水    -1017
购买时装        -1016
运营活动(领物-1015
神器精炼(-1014
神器精炼(-1013
商店购买        -1012

（说明：红色部分不能正常解析）

这个是linux环境的编码环境导致的：解决方法

1) 使用locale查看file.encoding这个系统变量，如果不是utf8可以运行这个命令 export LANG=zh_CN.utf8

或者这样LANG=zh_CN.utf8 java -Djava.ext.dir=/sdfls/asdlfjal/test Main执行之后

2) 或者java -Dfile.encoding=utf8 MainClass

测试3：修改测试环境

a. hadoop集群某节点的环境是utf8，java代码也是utf8编码

b. 将上面的latin1数据库的编码改为utf8（修改生效的判定：新建的表默认是utf8.修改之前默认的编码是latin1）

步骤及结果

url置为jdbc:mysql://123.*.*.108/db?CharSet=utf8&useUnicode=true&characterEncoding=utf8

sql还是普通写法:SELECT LOWER(consumption_name),consumption FROM info.dimen_table

在执行上面这个sql之前，查看字符集

System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 初始连接进去之后查看 字符集
/** 结果5
character_set_client    utf8
character_set_connection        utf8
character_set_database  utf8
character_set_filesystem        binary
character_set_results
<span style="color:#FF0000;">character_set_server    latin1</span>
character_set_system    utf8
character_sets_dir      /usr/share/mysql/charsets/
*/

statement.execute("set names 'utf8'"); //修改连接方式
//statement.execute("set character_set_server='utf8'");
System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 修改之后，查看连接使用的字符集
/** 结果6
character_set_client    utf8
character_set_connection        utf8
character_set_database  utf8
character_set_filesystem        binary
character_set_results
<span style="color:#FF0000;">character_set_server    utf8</span>
character_set_system    utf8
character_sets_dir      /usr/share/mysql/charsets/
*/

在ResultSet中，用utf8解释：new String(rs.getString(i).getBytes("UTF-8"));// 右边这个写法完全是乱码 new String(rs.getString(i).getBytes(),"UTF-8")

跟测试2的最终输出结果是一样的。

（对于纯粹的utf8环境：服务器编码是utf8;mysql的默认编码也是utf8，进入mysql之后查看字符集，会看到字符集展示就是结果6，即默认的character_set_server=utf8）

感觉所有问题的根源在于默认的character_set_server=latin1

http://*.com/questions/27866533/whacky-latin1-to-utf8-conversion-in-jdbc ：提到jdbc对不识别的latin1的编码字符插入了特殊的替换字符

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters

【参考】使用Java读写存储在latin1编码的MySQL中的UTF-8编码的中文

Character set bug at server with utf8 column and latin1 connection

Description: // This bug is reproduced using a MySQL Linux default installation where "character_set_server" is "latin1"‘

其他相关资料及描述

貌似最好把所有字符集设置成utf8的方法，解决起来最彻底；并且这个问题只出现在jdbc中，使用python操作的时候，读取都是正常的

    conn10 = MySQLdb.connect(host=db10, user=mdUser, passwd=mdPasswd, db="sdc_hdfs" )
    cursor10 = conn10.cursor()
    cursor10.execute("set @@autocommit=1")
    cursor10.execute("SHOW VARIABLES LIKE 'character_set_database'")
    data=cursor10.fetchone()
    if data[1]=='utf8':
        cursor10.close()
        cursor10 = MySQLdb.connect(db10, mdUser, mdPasswd, 'sdc_hdfs', charset='utf8', use_unicode=False).cursor() #使用utf8来连接
        cursor10.execute("set @@autocommit=1")

Java中String解码、编码

Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:

byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.
byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.
String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".
String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.

Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).

Unicode - How to get the characters right?

latin1转gbk的乱码问题，jdbc的bug

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

最后发现：

1. 不同jdbc对编码的支持情况不一样

2. 测试2，其实能解决问题，也就是set names 'utf8' + 在SQL里边使用convert()来转码是万能的：能绕开不同jdbc带来的问题。

秒客网

mysql字符编码问题latin1到utf8

相关文章