UTF-8 's format

时间:2023-12-21 12:09:08

几篇比较好的博客

古腾龙的博客:编码规则(UTF-8 GBK)

GBK 千千秀字

shell set

man ascii可以查看ascii码表,man utf-8看以查看utf-8的帮助

Unicode is a design,it includes all the characters on earth.It just defined the character set,just defined what characters should be included.It didn't define how to express these characters in computer.

UTF-8 is a implementation of Unicode.Its is designed in 1992 by Ken*Tompson(He and Riege created UNIX and C language together).

Unicode in java is 'char',2 bytes.From 0 to 0xffff.

But in UTF-8,different char has different bytes.

Unicode UTF-8 explanation
0000-007F 0xxx xxxx
0080-07FF 110xx xxx    10xx xxxx
0800-FFFF 1110 xxxx   10xx xxxx    10xx xxxx
   |  Unicode符号范围      |  UTF-8编码方式
 n |  (十六进制)           | (二进制)
---+-----------------------+------------------------------------------------------
  |   -  007F |                                              0xxxxxxx
  |   -  07FF |                                     110xxxxx 10xxxxxx
  |   -  FFFF |                            1110xxxx 10xxxxxx 10xxxxxx
  |   -  FFFF |                   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  |   - 03FF FFFF |          111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  |   - 7FFF FFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 

In java,'xxxxReader' is always text input and 'xxxxStream' is always a binary input.Firstly,We use text output 'PrintWriter' to write file.Then we use 'FileInputStream' to read file.Our task is to convert the binary information into Unicode.If what we write is the same with what we read,we can assure we comprehend the UTF-8 format.



UTF-8 's format

we can use java's library to convert a gbk file to unicode.
class uni {
    public static void main(String[] args) throws Exception {
        String name=args[0].substring(0,args[0].indexOf("."));
        PrintWriter cout = new PrintWriter(new File(name + "-unicode.txt"));
        InputStreamReader cin = new InputStreamReader(new FileInputStream(
                new File(args[0])), "GBK");
        char buf[] = new char[100];
        int n = cin.read(buf);
        while (n != -1) {
            cout.print(buf);
            n = cin.read(buf);
        }
        cin.close();
        cout.close();
    }
}