扫描仪与StringTokenizer和String.Split。

时间:2022-09-10 22:26:41

I just learned about Java's Scanner class and now I'm wondering how it compares/competes with the StringTokenizer and String.Split. I know that the StringTokenizer and String.Split only work on Strings, so why would I want to use the Scanner for a String? Is Scanner just intended to be one-stop-shopping for spliting?

我刚刚了解了Java的扫描器类,现在我想知道它是如何与StringTokenizer和String.Split相比较的。我知道StringTokenizer和String。拆分只处理字符串,那么为什么我要使用扫描器作为字符串呢?扫描器是否只是想要进行一次性的分割?

10 个解决方案

#1


219  

They're essentially horses for courses.

它们本质上是课程的马。

  • Scanner is designed for cases where you need to parse a string, pulling out data of different types. It's very flexible, but arguably doesn't give you the simplest API for simply getting an array of strings delimited by a particular expression.
  • 扫描器是为需要解析字符串、提取不同类型数据的情况而设计的。它是非常灵活的,但是可以论证的是,它并没有给您提供简单的API来获得一个特定表达式所限制的字符串数组。
  • String.split() and Pattern.split() give you an easy syntax for doing the latter, but that's essentially all that they do. If you want to parse the resulting strings, or change the delimiter halfway through depending on a particular token, they won't help you with that.
  • split()和Pattern.split()使您可以轻松地执行后者,但这实际上就是它们所做的全部工作。如果您想要解析生成的字符串,或者根据特定的令牌改变中间的分隔符,它们不会对您有帮助。
  • StringTokenizer is even more restrictive than String.split(), and also a bit fiddlier to use. It is essentially designed for pulling out tokens delimited by fixed substrings. Because of this restriction, it's about twice as fast as String.split(). (See my comparison of String.split() and StringTokenizer.) It also predates the regular expressions API, of which String.split() is a part.
  • StringTokenizer甚至比String.split()更有限制,而且使用起来也有点麻烦。它本质上是为取出由固定子字符串分隔的令牌而设计的。由于这个限制,它的速度是字符串的两倍。(参见我对String.split()和StringTokenizer的比较。)它还先于正则表达式API,其中String.split()是一部分。

You'll note from my timings that String.split() can still tokenize thousands of strings in a few milliseconds on a typical machine. In addition, it has the advantage over StringTokenizer that it gives you the output as a string array, which is usually what you want. Using an Enumeration, as provided by StringTokenizer, is too "syntactically fussy" most of the time. From this point of view, StringTokenizer is a bit of a waste of space nowadays, and you may as well just use String.split().

您会注意到,在一个典型的机器上,String.split()仍然可以在几毫秒内将数千个字符串标记出来。另外,它比StringTokenizer具有优势,它将输出作为字符串数组,这通常是您想要的。使用由StringTokenizer提供的枚举,在大多数情况下都是“语法繁琐”的。从这个角度来看,StringTokenizer现在有点浪费空间,你也可以用String.split()。

#2


54  

Let's start by eliminating StringTokenizer. It is getting old and doesn't even support regular expressions. Its documentation states:

让我们从消除StringTokenizer开始。它正在变老,甚至不支持正则表达式。它的文档:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

StringTokenizer是一个遗留类,由于兼容性的原因被保留,尽管在新代码中使用它是不鼓励的。建议任何人使用String或java.util的拆分方法来寻找该功能。regex包。

So let's throw it out right away. That leaves split() and Scanner. What's the difference between them?

我们马上把它扔出去。这样就产生了split()和扫描器。它们之间有什么区别?

For one thing, split() simply returns an array, which makes it easy to use a foreach loop:

首先,split()只返回一个数组,这样就很容易使用foreach循环:

for (String token : input.split("\\s+") { ... }

Scanner is built more like a stream:

扫描仪更像一条小溪:

while (myScanner.hasNext()) {
    String token = myScanner.next();
    ...
}

or

while (myScanner.hasNextDouble()) {
    double token = myScanner.nextDouble();
    ...
}

(It has a rather large API, so don't think that it's always restricted to such simple things.)

(它有一个相当大的API,所以不要认为它总是局限于这么简单的东西。)

This stream-style interface can be useful for parsing simple text files or console input, when you don't have (or can't get) all the input before starting to parse.

当您在开始解析之前没有(或无法得到)所有输入时,这个streamstyle接口可以用于解析简单的文本文件或控制台输入。

Personally, the only time I can remember using Scanner is for school projects, when I had to get user input from the command line. It makes that sort of operation easy. But if I have a String that I want to split up, it's almost a no-brainer to go with split().

就我个人而言,我唯一能记住使用扫描仪的时间是在学校的项目中,那时我必须从命令行获取用户输入。它使这种操作变得容易。但是如果我有一个想要分割的字符串,那么使用split()几乎是不需要动脑筋的。

#3


9  

StringTokenizer was always there. It is the fastest of all, but the enumeration-like idiom might not look as elegant as the others.

StringTokenizer总是在那里。它是最快的,但是枚举式的成语可能看起来不像其他的那样优雅。

split came to existence on JDK 1.4. Slower than tokenizer but easier to use, since it is callable from the String class.

在JDK 1.4中出现了拆分。比tokenizer更慢,但更容易使用,因为它可以从String类调用。

Scanner came to be on JDK 1.5. It is the most flexible and fills a long standing gap on the Java API to support an equivalent of the famous Cs scanf function family.

扫描仪出现在JDK 1.5上。它是最灵活的,在Java API上填补了一个长期的空白,以支持与著名的Cs scanf函数家族相当。

#4


6  

Split is slow, but not as slow as Scanner. StringTokenizer is faster than split. However, I found that I could obtain double the speed, by trading some flexibility, to get a speed-boost, which I did at JFastParser https://github.com/hughperkins/jfastparser

分裂是缓慢的,但不像扫描器那么慢。StringTokenizer比split更快。然而,我发现通过交易一些灵活性,我可以获得双倍的速度,以获得快速提升,我在JFastParser https://github.com/hughperkins/jfastparser中做了这个工作。

Testing on a string containing one million doubles:

测试一个包含一百万倍的字符串:

Scanner: 10642 ms
Split: 715 ms
StringTokenizer: 544ms
JFastParser: 290ms

#5


5  

If you have a String object you want to tokenize, favor using String's split method over a StringTokenizer. If you're parsing text data from a source outside your program, like from a file, or from the user, that's where a Scanner comes in handy.

如果您有一个字符串对象,您想要进行标记化,请使用String的split方法来处理StringTokenizer。如果您正在从程序之外的源解析文本数据,比如从文件中解析,或者从用户那里解析文本数据,那么这就是一个扫描器迟早会派上用场的地方。

#6


4  

I recently did some experiments about the bad performance of String.split() in highly performance sensitive situations. You may find this useful.

最近我做了一些关于String.split()在高性能敏感情况下的糟糕表现的实验。你会发现这很有用。

http://eblog.chrononsystems.com/hidden-evils-of-javas-stringsplit-and-stringr

http://eblog.chrononsystems.com/hidden-evils-of-javas-stringsplit-and-stringr

The gist is that String.split() compiles a Regular Expression pattern each time and can thus slow down your program, compared to if you use a precompiled Pattern object and use it directly to operate on a String.

其中的要点是,String.split()每次编译一个正则表达式模式,因此可以降低程序的速度,而如果使用预编译的模式对象,并直接使用它来操作字符串。

#7


3  

String.split seems to be much slower than StringTokenizer. The only advantage with split is that you get an array of the tokens. Also you can use any regular expressions in split. org.apache.commons.lang.StringUtils has a split method which works much more faster than any of two viz. StringTokenizer or String.split. But the CPU utilization for all the three is nearly the same. So we also need a method which is less CPU intensive, which I am still not able to find.

字符串。拆分似乎比StringTokenizer慢得多。拆分的唯一好处是,您得到了一个令牌数组。你也可以在拆分中使用任何正则表达式。stringutils有一个分裂的方法,它的工作速度比两个viz中的任何一个都快得多。但是所有三个的CPU利用率几乎是一样的。因此,我们还需要一种更少CPU密集型的方法,我仍然无法找到它。

#8


1  

For the default scenarios I would suggest Pattern.split() as well but if you need maximum performance (especially on Android all solutions I tested are quite slow) and you only need to split by a single char, I now use my own method:

对于默认场景,我建议采用Pattern.split(),但是如果您需要最大性能(尤其是在Android上,我测试的所有解决方案都非常慢),您只需要使用一个char,我现在使用我自己的方法:

public static ArrayList<String> splitBySingleChar(final char[] s,
        final char splitChar) {
    final ArrayList<String> result = new ArrayList<String>();
    final int length = s.length;
    int offset = 0;
    int count = 0;
    for (int i = 0; i < length; i++) {
        if (s[i] == splitChar) {
            if (count > 0) {
                result.add(new String(s, offset, count));
            }
            offset = i + 1;
            count = 0;
        } else {
            count++;
        }
    }
    if (count > 0) {
        result.add(new String(s, offset, count));
    }
    return result;
}

Use "abc".toCharArray() to get the char array for a String. For example:

使用“abc”. tochararray()来获取字符串的char数组。例如:

String s = "     a bb   ccc  dddd eeeee  ffffff    ggggggg ";
ArrayList<String> result = splitBySingleChar(s.toCharArray(), ' ');

#9


1  

One important difference is that both String.split() and Scanner can produce empty strings but StringTokenizer never does it.

一个重要的区别是,String.split()和扫描器可以产生空字符串,但是StringTokenizer从不这样做。

For example:

例如:

String str = "ab cd  ef";

StringTokenizer st = new StringTokenizer(str, " ");
for (int i = 0; st.hasMoreTokens(); i++) System.out.println("#" + i + ": " + st.nextToken());

String[] split = str.split(" ");
for (int i = 0; i < split.length; i++) System.out.println("#" + i + ": " + split[i]);

Scanner sc = new Scanner(str).useDelimiter(" ");
for (int i = 0; sc.hasNext(); i++) System.out.println("#" + i + ": " + sc.next());

Output:

输出:

//StringTokenizer
#0: ab
#1: cd
#2: ef
//String.split()
#0: ab
#1: cd
#2: 
#3: ef
//Scanner
#0: ab
#1: cd
#2: 
#3: ef

This is because the delimiter for String.split() and Scanner.useDelimiter() is not just a string, but a regular expression. We can replace the delimiter " " with " +" in the example above to make them behave like StringTokenizer.

这是因为string .split()和Scanner.useDelimiter()的分隔符不只是一个字符串,而是一个正则表达式。在上面的示例中,我们可以用“+”替换分隔符“+”,使它们的行为像StringTokenizer。

#10


-6  

String.split() works very good but has its own boundaries, like if you wanted to split a string as shown below based on single or double pipe (|) symbol, it doesn't work. In this situation you can use StringTokenizer.

split()很好,但是有它自己的边界,就像如果你想按照下面的单或双管(|)符号拆分一个字符串,它就不起作用了。在这种情况下,您可以使用StringTokenizer。

ABC|IJK

ABC | IJK

#1


219  

They're essentially horses for courses.

它们本质上是课程的马。

  • Scanner is designed for cases where you need to parse a string, pulling out data of different types. It's very flexible, but arguably doesn't give you the simplest API for simply getting an array of strings delimited by a particular expression.
  • 扫描器是为需要解析字符串、提取不同类型数据的情况而设计的。它是非常灵活的,但是可以论证的是,它并没有给您提供简单的API来获得一个特定表达式所限制的字符串数组。
  • String.split() and Pattern.split() give you an easy syntax for doing the latter, but that's essentially all that they do. If you want to parse the resulting strings, or change the delimiter halfway through depending on a particular token, they won't help you with that.
  • split()和Pattern.split()使您可以轻松地执行后者,但这实际上就是它们所做的全部工作。如果您想要解析生成的字符串,或者根据特定的令牌改变中间的分隔符,它们不会对您有帮助。
  • StringTokenizer is even more restrictive than String.split(), and also a bit fiddlier to use. It is essentially designed for pulling out tokens delimited by fixed substrings. Because of this restriction, it's about twice as fast as String.split(). (See my comparison of String.split() and StringTokenizer.) It also predates the regular expressions API, of which String.split() is a part.
  • StringTokenizer甚至比String.split()更有限制,而且使用起来也有点麻烦。它本质上是为取出由固定子字符串分隔的令牌而设计的。由于这个限制,它的速度是字符串的两倍。(参见我对String.split()和StringTokenizer的比较。)它还先于正则表达式API,其中String.split()是一部分。

You'll note from my timings that String.split() can still tokenize thousands of strings in a few milliseconds on a typical machine. In addition, it has the advantage over StringTokenizer that it gives you the output as a string array, which is usually what you want. Using an Enumeration, as provided by StringTokenizer, is too "syntactically fussy" most of the time. From this point of view, StringTokenizer is a bit of a waste of space nowadays, and you may as well just use String.split().

您会注意到,在一个典型的机器上,String.split()仍然可以在几毫秒内将数千个字符串标记出来。另外,它比StringTokenizer具有优势,它将输出作为字符串数组,这通常是您想要的。使用由StringTokenizer提供的枚举,在大多数情况下都是“语法繁琐”的。从这个角度来看,StringTokenizer现在有点浪费空间,你也可以用String.split()。

#2


54  

Let's start by eliminating StringTokenizer. It is getting old and doesn't even support regular expressions. Its documentation states:

让我们从消除StringTokenizer开始。它正在变老,甚至不支持正则表达式。它的文档:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

StringTokenizer是一个遗留类,由于兼容性的原因被保留,尽管在新代码中使用它是不鼓励的。建议任何人使用String或java.util的拆分方法来寻找该功能。regex包。

So let's throw it out right away. That leaves split() and Scanner. What's the difference between them?

我们马上把它扔出去。这样就产生了split()和扫描器。它们之间有什么区别?

For one thing, split() simply returns an array, which makes it easy to use a foreach loop:

首先,split()只返回一个数组,这样就很容易使用foreach循环:

for (String token : input.split("\\s+") { ... }

Scanner is built more like a stream:

扫描仪更像一条小溪:

while (myScanner.hasNext()) {
    String token = myScanner.next();
    ...
}

or

while (myScanner.hasNextDouble()) {
    double token = myScanner.nextDouble();
    ...
}

(It has a rather large API, so don't think that it's always restricted to such simple things.)

(它有一个相当大的API,所以不要认为它总是局限于这么简单的东西。)

This stream-style interface can be useful for parsing simple text files or console input, when you don't have (or can't get) all the input before starting to parse.

当您在开始解析之前没有(或无法得到)所有输入时,这个streamstyle接口可以用于解析简单的文本文件或控制台输入。

Personally, the only time I can remember using Scanner is for school projects, when I had to get user input from the command line. It makes that sort of operation easy. But if I have a String that I want to split up, it's almost a no-brainer to go with split().

就我个人而言,我唯一能记住使用扫描仪的时间是在学校的项目中,那时我必须从命令行获取用户输入。它使这种操作变得容易。但是如果我有一个想要分割的字符串,那么使用split()几乎是不需要动脑筋的。

#3


9  

StringTokenizer was always there. It is the fastest of all, but the enumeration-like idiom might not look as elegant as the others.

StringTokenizer总是在那里。它是最快的,但是枚举式的成语可能看起来不像其他的那样优雅。

split came to existence on JDK 1.4. Slower than tokenizer but easier to use, since it is callable from the String class.

在JDK 1.4中出现了拆分。比tokenizer更慢,但更容易使用,因为它可以从String类调用。

Scanner came to be on JDK 1.5. It is the most flexible and fills a long standing gap on the Java API to support an equivalent of the famous Cs scanf function family.

扫描仪出现在JDK 1.5上。它是最灵活的,在Java API上填补了一个长期的空白,以支持与著名的Cs scanf函数家族相当。

#4


6  

Split is slow, but not as slow as Scanner. StringTokenizer is faster than split. However, I found that I could obtain double the speed, by trading some flexibility, to get a speed-boost, which I did at JFastParser https://github.com/hughperkins/jfastparser

分裂是缓慢的,但不像扫描器那么慢。StringTokenizer比split更快。然而,我发现通过交易一些灵活性,我可以获得双倍的速度,以获得快速提升,我在JFastParser https://github.com/hughperkins/jfastparser中做了这个工作。

Testing on a string containing one million doubles:

测试一个包含一百万倍的字符串:

Scanner: 10642 ms
Split: 715 ms
StringTokenizer: 544ms
JFastParser: 290ms

#5


5  

If you have a String object you want to tokenize, favor using String's split method over a StringTokenizer. If you're parsing text data from a source outside your program, like from a file, or from the user, that's where a Scanner comes in handy.

如果您有一个字符串对象,您想要进行标记化,请使用String的split方法来处理StringTokenizer。如果您正在从程序之外的源解析文本数据,比如从文件中解析,或者从用户那里解析文本数据,那么这就是一个扫描器迟早会派上用场的地方。

#6


4  

I recently did some experiments about the bad performance of String.split() in highly performance sensitive situations. You may find this useful.

最近我做了一些关于String.split()在高性能敏感情况下的糟糕表现的实验。你会发现这很有用。

http://eblog.chrononsystems.com/hidden-evils-of-javas-stringsplit-and-stringr

http://eblog.chrononsystems.com/hidden-evils-of-javas-stringsplit-and-stringr

The gist is that String.split() compiles a Regular Expression pattern each time and can thus slow down your program, compared to if you use a precompiled Pattern object and use it directly to operate on a String.

其中的要点是,String.split()每次编译一个正则表达式模式,因此可以降低程序的速度,而如果使用预编译的模式对象,并直接使用它来操作字符串。

#7


3  

String.split seems to be much slower than StringTokenizer. The only advantage with split is that you get an array of the tokens. Also you can use any regular expressions in split. org.apache.commons.lang.StringUtils has a split method which works much more faster than any of two viz. StringTokenizer or String.split. But the CPU utilization for all the three is nearly the same. So we also need a method which is less CPU intensive, which I am still not able to find.

字符串。拆分似乎比StringTokenizer慢得多。拆分的唯一好处是,您得到了一个令牌数组。你也可以在拆分中使用任何正则表达式。stringutils有一个分裂的方法,它的工作速度比两个viz中的任何一个都快得多。但是所有三个的CPU利用率几乎是一样的。因此,我们还需要一种更少CPU密集型的方法,我仍然无法找到它。

#8


1  

For the default scenarios I would suggest Pattern.split() as well but if you need maximum performance (especially on Android all solutions I tested are quite slow) and you only need to split by a single char, I now use my own method:

对于默认场景,我建议采用Pattern.split(),但是如果您需要最大性能(尤其是在Android上,我测试的所有解决方案都非常慢),您只需要使用一个char,我现在使用我自己的方法:

public static ArrayList<String> splitBySingleChar(final char[] s,
        final char splitChar) {
    final ArrayList<String> result = new ArrayList<String>();
    final int length = s.length;
    int offset = 0;
    int count = 0;
    for (int i = 0; i < length; i++) {
        if (s[i] == splitChar) {
            if (count > 0) {
                result.add(new String(s, offset, count));
            }
            offset = i + 1;
            count = 0;
        } else {
            count++;
        }
    }
    if (count > 0) {
        result.add(new String(s, offset, count));
    }
    return result;
}

Use "abc".toCharArray() to get the char array for a String. For example:

使用“abc”. tochararray()来获取字符串的char数组。例如:

String s = "     a bb   ccc  dddd eeeee  ffffff    ggggggg ";
ArrayList<String> result = splitBySingleChar(s.toCharArray(), ' ');

#9


1  

One important difference is that both String.split() and Scanner can produce empty strings but StringTokenizer never does it.

一个重要的区别是,String.split()和扫描器可以产生空字符串,但是StringTokenizer从不这样做。

For example:

例如:

String str = "ab cd  ef";

StringTokenizer st = new StringTokenizer(str, " ");
for (int i = 0; st.hasMoreTokens(); i++) System.out.println("#" + i + ": " + st.nextToken());

String[] split = str.split(" ");
for (int i = 0; i < split.length; i++) System.out.println("#" + i + ": " + split[i]);

Scanner sc = new Scanner(str).useDelimiter(" ");
for (int i = 0; sc.hasNext(); i++) System.out.println("#" + i + ": " + sc.next());

Output:

输出:

//StringTokenizer
#0: ab
#1: cd
#2: ef
//String.split()
#0: ab
#1: cd
#2: 
#3: ef
//Scanner
#0: ab
#1: cd
#2: 
#3: ef

This is because the delimiter for String.split() and Scanner.useDelimiter() is not just a string, but a regular expression. We can replace the delimiter " " with " +" in the example above to make them behave like StringTokenizer.

这是因为string .split()和Scanner.useDelimiter()的分隔符不只是一个字符串,而是一个正则表达式。在上面的示例中,我们可以用“+”替换分隔符“+”,使它们的行为像StringTokenizer。

#10


-6  

String.split() works very good but has its own boundaries, like if you wanted to split a string as shown below based on single or double pipe (|) symbol, it doesn't work. In this situation you can use StringTokenizer.

split()很好,但是有它自己的边界,就像如果你想按照下面的单或双管(|)符号拆分一个字符串,它就不起作用了。在这种情况下,您可以使用StringTokenizer。

ABC|IJK

ABC | IJK