java charset detector

https://code.google.com/p/juniversalchardet/downloads/list

java移植mozilla的编码自动检测库（源码为c++）,准确率高。

通过svn签出只读版本的代码：

# Non-members may check out a read-only working copy anonymously over HTTP.
svn checkout http://juniversalchardet.googlecode.com/svn/trunk/ juniversalchardet-read-only

package myjava;

import java.io.File;

import java.io.IOException;

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {

    public static void main(String[] args) throws java.io.IOException {

        String folder = "/home/hadoop/test/charset/";

        File file = new File(folder);

        for (File _file : file.listFiles())

            detectCharset(_file.getAbsolutePath());

    }

    static void detectCharset(String fileName) throws IOException {

        byte[] buf = new byte[4096];

        java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

        // (1)

        UniversalDetector detector = new UniversalDetector(null);

        // (2)

        int nread;

        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {

            detector.handleData(buf, 0, nread);

        }

        // (3)

        detector.dataEnd();

        // (4)

        String encoding = detector.getDetectedCharset();

        if (encoding != null) {

            System.out.println("Detected encoding = " + encoding);

        } else {

            System.out.println("No encoding detected.");

        }

        // (5)

        detector.reset();

    }

}

可以结合另外一个java的字符集检测库来保证更好的结果，因为对于短文来说,上面的检测方法可能无法得出结论。

同时因为这个算法来自于mozilla,它应该能更好地作用于html等标签文件的检测。

http://cpdetector.sourceforge.net/usage.shtml

秒客网

java charset detector

相关文章