中文数据网络传输转码与解码过程浅析

网络中传输数据，尤其是中文必然会遇到，转码与解码过程，中文产生乱码问题也就发生在该过程的某一环节，下面我将用代码的方式模拟整个转码和解码过程，相信理解此文之后，对所有中文乱码都会找到原因并处理之。在此之前，我们首先解一下网络发送数据的过程。以中文为例：中文的传输过程具体可能是：内存中unicode -> 编码阶段gbk, gb18030,gb2312,utf8 -> 到ISO8859-1 ->最后到可能的base64编码。其实传输ISO8859-1的字符就已经可以进行转换了，后面要进行Base64编码，我个人理解是为了网络发送和接受数据串简单，仅仅用基本的64个字符表示而已（个人观点，如有纰漏请不吝赐教！）。本例为了方便理解中文转码过程，没有进行base64的再次编码，关于base64编码与解码比较简单，请不理解的求助于网络。

下面我们开始上代码：

代码： Byte2HexUtil.java

package zmx.util;

import java.math.BigInteger;
/**
 * 
 * @author zhangwenchao
 *
 */
public class Byte2HexUtil {

      public static String bytes2hex01(byte[] bytes)  
    {  
        /** 
         * 第一个参数的解释，记得一定要设置为1 
         *  signum of the number (-1 for negative, 0 for zero, 1 for positive). 
         */  
        BigInteger bigInteger = new BigInteger(1, bytes);  
        return bigInteger.toString(16);  
    } 


    /**
     * 方式二
     * 
     * @param bytes
     * @return
     */
    public static String bytes2hex02(byte[] bytes)
    {
        StringBuilder sb = new StringBuilder();
        String tmp = null;
        for (byte b : bytes)
        {
            // 将每个字节与0xFF进行与运算，然后转化为10进制，然后借助于Integer再转化为16进制
            tmp = Integer.toHexString(0xFF & b);
            if (tmp.length() == 1)// 每个字节8位，转为16进制标志，2个16进制位
            {
                tmp = "0" + tmp;
            }
            sb.append(tmp);
        }

        return sb.toString();

    }

    /**
     * 方式三
     * 
     * @param bytes
     * @return
     */
    public static String bytes2hex03(byte[] bytes)
    {
        final String HEX = "0123456789abcdef";
        StringBuilder sb = new StringBuilder(bytes.length * 2);
        for (byte b : bytes)
        {
            // 取出这个字节的高4位，然后与0x0f与运算，得到一个0-15之间的数据，通过HEX.charAt(0-15)即为16进制数
            sb.append(HEX.charAt((b >> 4) & 0x0f));
            // 取出这个字节的低位，与0x0f与运算，得到一个0-15之间的数据，通过HEX.charAt(0-15)即为16进制数
            sb.append(HEX.charAt(b & 0x0f));
        }

        return sb.toString();
    }

    public static void main(String[] args) {
byte[] bytes = {10,23,24,54};
System.out.println(Byte2HexUtil.bytes2hex01(bytes));
    System.out.println(Byte2HexUtil.bytes2hex02(bytes));
    System.out.println(Byte2HexUtil.bytes2hex03(bytes));

}



}

这是一个工具类主要用于将byte[]数组转换为16进制字符串，16进制也可以理解为2进制的表现形式。

2、转码与解码过程：

package zmx.test;

import zmx.util.Byte2HexUtil;



public class T10 {

public static void print(byte[] bytes) throws Exception{
for(byte b: bytes){
System.out.print(b+" "+ new String(new byte[]{b},"ISO8859-1")+"     ");
}
System.out.println();
}

public static void print(String str) throws Exception{
for(int i=0;i<str.length();i++){
System.out.print(str.charAt(i)+" "+ ((byte)str.charAt(i))+"     ");
}
System.out.println();
}

   public static void main(String[] args) throws Exception {

String chinese = "abc中文";  //中文字符串

/*byte[] unicodes =  chinese.getBytes("UNICODE");
System.out.println(unicodes.length);
print(unicodes);
System.out.println(Byte2HexUtil.bytes2hex03(unicodes));*/

byte[] bg2312 = chinese.getBytes("GB2312"); //根据某一中文编码(ASCall和ISO8859-1不包含中文)获取字节数组
System.out.println(bg2312.length); //不同的中文格式编码获取的字节数组长度不同。
print(bg2312);
String sender = new String(bg2312,"ISO8859-1");
System.out.println("发送的数据:"+sender); //将该字节数组根据ISO8859-1转换为网络可传输的形式
System.out.println(Byte2HexUtil.bytes2hex03(bg2312)); //本质上就是传输编码之后的字节数组

String receive = sender;  //接受的数据
System.out.println("接收的数据:"+receive);
print(receive);
byte[] receiveBytes = sender.getBytes("ISO8859-1");

System.out.println(Byte2HexUtil.bytes2hex03(receiveBytes));

System.out.println(new String(receiveBytes,"GB2312"));
System.out.println(new String(receiveBytes,"GB2312").length());

   }

}

测试结果：

7
97 a     98 b     99 c     -42 Ö     -48 Ð     -50 Î     -60 Ä     
发送的数据:abcÖÐÎÄ
616263d6d0cec4
接收的数据:abcÖÐÎÄ
a 97     b 98     c 99     Ö -42     Ð -48     Î -50     Ä -60     
616263d6d0cec4
abc中文
5

通过代码我们可以很明显的看出，对于字符串：“abc中文”，在发送前我们可以将其转换为“unicode/gb2312/utf-8”等不同格式的字节码数组，例如：byte[] bg2312 = chinese.getBytes("GB2312");使用GB2312进行转码。将得到的字节数组准换成16进制字符串之后得到“616263d6d0cec4”，其中，英文占1个字节中文占两个字节。根据gb2312的编码规范：61-a，62-b，63-c，d6d0-中，cece-文。使用其他编码格式转码也是同理。之后我们把该字节数组转换为网络中可以传输的ISO8859-1的字符串，使用String sender = new String(bg2312,"ISO8859-1");得到：abcÖÐÎÄ。其实给数据串就可以发送了。但是为了不产生特殊字符等，真正传输过程又做了base64变换。得到字符串之后在进行逆变换就可以恢复中文。

现在是不是对编码解码有了更深的理解，其实本质上就是这个简单，乱码也就是在变换过程中产生的，望读者对乱码产生原因多加分析。

秒客网

中文数据网络传输转码与解码过程浅析

相关文章