https://code.google.com/p/juniversalchardet/downloads/list
java移植mozilla的编码自动检测库(源码为c++),准确率高。
通过svn签出只读版本的代码:
# Non-members may check out a read-only working copy anonymously over HTTP.
svn checkout http://juniversalchardet.googlecode.com/svn/trunk/ juniversalchardet-read-only
package myjava; import java.io.File;
import java.io.IOException; import org.mozilla.universalchardet.UniversalDetector; public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
String folder = "/home/hadoop/test/charset/";
File file = new File(folder);
for (File _file : file.listFiles())
detectCharset(_file.getAbsolutePath());
} static void detectCharset(String fileName) throws IOException {
byte[] buf = new byte[4096];
java.io.FileInputStream fis = new java.io.FileInputStream(fileName); // (1)
UniversalDetector detector = new UniversalDetector(null); // (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd(); // (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
} // (5)
detector.reset();
}
}
可以结合另外一个java的字符集检测库来保证更好的结果,因为对于短文来说,上面的检测方法可能无法得出结论。
同时因为这个算法来自于mozilla,它应该能更好地作用于html等标签文件的检测。