Java的Scanner vs String.split（）vs StringTokenizer;我应该用哪个？

I am currently using split() to scan through a file where each line has number of strings delimited by '~'. I read somewhere that Scanner could do a better job with a long file, performance-wise, so I thought about checking it out.

我目前正在使用split（）来扫描一个文件，其中每一行都有'〜'分隔的字符串数。我在某处读到Scanner可以用一个长文件做得更好，性能方面，所以我想考虑一下。

My question is: Would I have to create two instances of Scanner? That is, one to read a line and another one based on the line to get tokens for a delimiter? If I have to do so, I doubt if I would get any advantage from using it. Maybe I am missing something here?

我的问题是：我是否必须创建两个Scanner实例？也就是说，一个读取一行而另一个基于该行来获取分隔符的标记？如果我必须这样做，我怀疑我是否会从使用它中获得任何好处。也许我在这里错过了一些东西？

5 个解决方案

#1

Did some metrics around these in a single threaded model and here are the results I got.

在单线程模型中有一些关于这些的指标，这是我得到的结果。

~~~~~~~~~~~~~~~~~~Time Metrics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tokenizer  |   String.Split()   |    while+SubString  |    Scanner    |    ScannerWithCompiledPattern    ~
~   4.0 ms   |      5.1 ms        |        1.2 ms       |     0.5 ms    |                0.1 ms            ~
~   4.4 ms   |      4.8 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.2 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
____________________________________________________________________________________________________________

The out come is that Scanner gives the best performance, Now the same needs to be evaluated on a multithreaded mode ! One of my senior's say that the Tokenizer gives a CPU spike and String.split does not.

出来的是Scanner提供了最佳性能，现在同样需要在多线程模式下进行评估！我的一位资深人士表示，Tokenizer会产生CPU峰值而String.split则没有。

#2

For processing line you can use scanner and for getting tokens from each line you can use split.

对于处理线，您可以使用扫描仪，并从您可以使用拆分的每一行获取令牌。

Scanner scanner = new Scanner(new File(loc));
try {
    while ( scanner.hasNextLine() ){
        String[] tokens = scanner.nextLine().split("~");
        // do the processing for tokens here
    }
}
finally {
    scanner.close();
}

#3

You can use the useDelimiter("~") method to let you iterate through the tokens on each line with hasNext()/next(), while still using hasNextLine()/nextLine() to iterate through the lines themselves.

您可以使用useDelimiter（“〜”）方法让您使用hasNext（）/ next（）迭代每行上的标记，同时仍然使用hasNextLine（）/ nextLine（）来遍历行本身。

EDIT: If you're going to do a performance comparison, you should pre-compile the regex when you do the split() test:

编辑：如果你要进行性能比较，你应该在进行split（）测试时预编译正则表达式：

Pattern splitRegex = Pattern.compile("~");
while ((line = bufferedReader.readLine()) != null)
{
  String[] tokens = splitRegex.split(line);
  // etc.
}

If you use String#split(String regex), the regex will be recompiled every time. (Scanner automatically caches all regexes the first time it compiles them.) If you do that, I wouldn't expect to see much difference in performance.

如果使用String＃split（String regex），则每次都会重新编译正则表达式。（扫描程序在第一次编译它们时会自动缓存所有正则表达式。）如果这样做，我不希望看到性能上有太大差异。

#4

I would say split() is fastest, and probably good enough for what you're doing. It is less flexible than scanner though. StringTokenizer is deprecated and is only available for backwards compatibility, so don't use it.

我会说split（）是最快的，并且可能对你正在做的事情足够好。但它不如扫描仪灵活。 StringTokenizer已弃用，仅可用于向后兼容，因此请勿使用它。

EDIT: You could always test both implementations to see which one is faster. I'm curious myself if scanner could be faster than split(). Split might be faster for a given size VS Scanner, but I can't be certain of that.

编辑：你总是可以测试两个实现，看看哪个更快。如果扫描仪比split（）更快，我很好奇。对于给定大小的VS Scanner，拆分可能会更快，但我无法确定。

#5

You don't actually need a regex here, because you are splitting on a fixed string. Apache StringUtils split does splitting on plain strings.

你实际上并不需要这里的正则表达式，因为你正在分裂一个固定的字符串。 Apache StringUtils split会拆分普通字符串。

For high volume splits, where the splitting is the bottleneck, rather than say file IO, I've found this to be up to 10 times faster than String.split(). However, I did not test it against a compiled regex.

对于高分割，分割是瓶颈，而不是说文件IO，我发现它比String.split（）快10倍。但是，我没有针对编译的正则表达式进行测试。

Guava also has a splitter, implemented in a more OO way, but I found it was significantly slower than StringUtils for high volume splits.

Guava还有一个分离器，以更多的OO方式实现，但我发现它比StringUtils显着慢于高容量分割。

#1