使用Java Regex,如何检查字符串是否包含集合中的任何单词?

时间:2022-12-30 19:03:41

I have a set of words say -- apple, orange, pear , banana, kiwi

我有一套词说——苹果,橘子,梨,香蕉,猕猴桃

I want to check if a sentence contains any of the above listed words, and If it does , I want to find which word matched. How can I accomplish this in Regex ?

我想检查一个句子是否包含上面列出的任何一个词,如果包含了,我想找到匹配的词。如何在Regex中实现这一点?

I am currently calling String.indexOf() for each of my set of words. I am assuming this is not as efficient as a regex matching?

我正在为我的每一组单词调用String.indexOf()。我假设这没有regex匹配那么有效?

3 个解决方案

#1


47  

TL;DR For simple substrings contains() is best but for only matching whole words Regular Expression are probably better.

对于简单的子字符串contains()来说,DR是最好的,但是对于只匹配整个单词来说,正则表达式可能更好。

The best way to see which method is more efficient is to test it.

查看哪种方法更有效的最好方法是测试它。

You can use String.contains() instead of String.indexOf() to simplify your non-regexp code.

可以使用String.contains()而不是String.indexOf()来简化非regexp代码。

To search for different words the Regular Expression looks like this:

要搜索不同的单词,正则表达式如下:

apple|orange|pear|banana|kiwi

The | works as an OR in Regular Expressions.

|作为一个或在正则表达式中工作。

My very simple test code looks like this:

我的非常简单的测试代码如下所示:

public class TestContains {

   private static String containsWord(Set<String> words,String sentence) {
     for (String word : words) {
       if (sentence.contains(word)) {
         return word;
       }
     }

     return null;
   }

   private static String matchesPattern(Pattern p,String sentence) {
     Matcher m = p.matcher(sentence);

     if (m.find()) {
       return m.group();
     }

     return null;
   }

   public static void main(String[] args) {
     Set<String> words = new HashSet<String>();
     words.add("apple");
     words.add("orange");
     words.add("pear");
     words.add("banana");
     words.add("kiwi");

     Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi");

     String noMatch = "The quick brown fox jumps over the lazy dog.";
     String startMatch = "An apple is nice";
     String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi";

     long start = System.currentTimeMillis();
     int iterations = 10000000;

     for (int i = 0; i < iterations; i++) {
       containsWord(words, noMatch);
       containsWord(words, startMatch);
       containsWord(words, endMatch);
     }

     System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms");
     start = System.currentTimeMillis();

     for (int i = 0; i < iterations; i++) {
       matchesPattern(p,noMatch);
       matchesPattern(p,startMatch);
       matchesPattern(p,endMatch);
     }

     System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms");
   }
}

The results I got were as follows:

我得到的结果如下:

Contains took 5962ms
Regular Expression took 63475ms

Obviously timings will vary depending on the number of words being searched for and the Strings being searched, but contains() does seem to be ~10 times faster than regular expressions for a simple search like this.

显然,计时会根据搜索的单词的数量和搜索的字符串而有所不同,但是包含()的速度似乎比普通的搜索要快10倍。

By using Regular Expressions to search for Strings inside another String you're using a sledgehammer to crack a nut so I guess we shouldn't be surprised that it's slower. Save Regular Expressions for when the patterns you want to find are more complex.

通过使用正则表达式在另一个字符串中搜索字符串,您正在使用一个大锤来敲开一个螺帽,所以我想我们不应该对它的速度感到惊讶。将正则表达式保存到您想要查找的模式更复杂的时候。

One case where you may want to use Regular Expressions is if indexOf() and contains() won't do the job because you only want to match whole words and not just substrings, e.g. you want to match pear but not spears. Regular Expressions handle this case well as they have the concept of word boundaries.

您可能想要使用正则表达式的一种情况是,如果indexOf()和contains()不能完成这项工作,因为您只想匹配整个单词,而不仅仅是子字符串,例如,您想匹配pear而不是spears。正则表达式很好地处理了这种情况,因为它们具有单词边界的概念。

In this case we'd change our pattern to:

在这种情况下,我们将改变我们的模式为:

\b(apple|orange|pear|banana|kiwi)\b

The \b says to only match the beginning or end of a word and the brackets group the OR expressions together.

\b的意思是只匹配一个单词的开头或结尾,括号将或表达式分组在一起。

Note, when defining this pattern in your code you need to escape the backslashes with another backslash:

注意,在代码中定义此模式时,需要使用另一个反斜杠来转义反斜杠:

 Pattern p = Pattern.compile("\\b(apple|orange|pear|banana|kiwi)\\b");

#2


7  

I don't think a regexp will do a better job in terms of performance but you can use it as follow:

我不认为regexp在性能方面会做得更好,但您可以如下使用:

Pattern p = Pattern.compile("(apple|orange|pear)");
Matcher m = p.matcher(inputString);
while (m.find()) {
   String matched = m.group(1);
   // Do something
}

#3


3  

Here is the most simple solution I found (matching with wildcards):

下面是我找到的最简单的解决方案(与通配符匹配):

boolean a = str.matches(".*\\b(wordA|wordB|wordC|wordD|wordE)\\b.*");

#1


47  

TL;DR For simple substrings contains() is best but for only matching whole words Regular Expression are probably better.

对于简单的子字符串contains()来说,DR是最好的,但是对于只匹配整个单词来说,正则表达式可能更好。

The best way to see which method is more efficient is to test it.

查看哪种方法更有效的最好方法是测试它。

You can use String.contains() instead of String.indexOf() to simplify your non-regexp code.

可以使用String.contains()而不是String.indexOf()来简化非regexp代码。

To search for different words the Regular Expression looks like this:

要搜索不同的单词,正则表达式如下:

apple|orange|pear|banana|kiwi

The | works as an OR in Regular Expressions.

|作为一个或在正则表达式中工作。

My very simple test code looks like this:

我的非常简单的测试代码如下所示:

public class TestContains {

   private static String containsWord(Set<String> words,String sentence) {
     for (String word : words) {
       if (sentence.contains(word)) {
         return word;
       }
     }

     return null;
   }

   private static String matchesPattern(Pattern p,String sentence) {
     Matcher m = p.matcher(sentence);

     if (m.find()) {
       return m.group();
     }

     return null;
   }

   public static void main(String[] args) {
     Set<String> words = new HashSet<String>();
     words.add("apple");
     words.add("orange");
     words.add("pear");
     words.add("banana");
     words.add("kiwi");

     Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi");

     String noMatch = "The quick brown fox jumps over the lazy dog.";
     String startMatch = "An apple is nice";
     String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi";

     long start = System.currentTimeMillis();
     int iterations = 10000000;

     for (int i = 0; i < iterations; i++) {
       containsWord(words, noMatch);
       containsWord(words, startMatch);
       containsWord(words, endMatch);
     }

     System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms");
     start = System.currentTimeMillis();

     for (int i = 0; i < iterations; i++) {
       matchesPattern(p,noMatch);
       matchesPattern(p,startMatch);
       matchesPattern(p,endMatch);
     }

     System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms");
   }
}

The results I got were as follows:

我得到的结果如下:

Contains took 5962ms
Regular Expression took 63475ms

Obviously timings will vary depending on the number of words being searched for and the Strings being searched, but contains() does seem to be ~10 times faster than regular expressions for a simple search like this.

显然,计时会根据搜索的单词的数量和搜索的字符串而有所不同,但是包含()的速度似乎比普通的搜索要快10倍。

By using Regular Expressions to search for Strings inside another String you're using a sledgehammer to crack a nut so I guess we shouldn't be surprised that it's slower. Save Regular Expressions for when the patterns you want to find are more complex.

通过使用正则表达式在另一个字符串中搜索字符串,您正在使用一个大锤来敲开一个螺帽,所以我想我们不应该对它的速度感到惊讶。将正则表达式保存到您想要查找的模式更复杂的时候。

One case where you may want to use Regular Expressions is if indexOf() and contains() won't do the job because you only want to match whole words and not just substrings, e.g. you want to match pear but not spears. Regular Expressions handle this case well as they have the concept of word boundaries.

您可能想要使用正则表达式的一种情况是,如果indexOf()和contains()不能完成这项工作,因为您只想匹配整个单词,而不仅仅是子字符串,例如,您想匹配pear而不是spears。正则表达式很好地处理了这种情况,因为它们具有单词边界的概念。

In this case we'd change our pattern to:

在这种情况下,我们将改变我们的模式为:

\b(apple|orange|pear|banana|kiwi)\b

The \b says to only match the beginning or end of a word and the brackets group the OR expressions together.

\b的意思是只匹配一个单词的开头或结尾,括号将或表达式分组在一起。

Note, when defining this pattern in your code you need to escape the backslashes with another backslash:

注意,在代码中定义此模式时,需要使用另一个反斜杠来转义反斜杠:

 Pattern p = Pattern.compile("\\b(apple|orange|pear|banana|kiwi)\\b");

#2


7  

I don't think a regexp will do a better job in terms of performance but you can use it as follow:

我不认为regexp在性能方面会做得更好,但您可以如下使用:

Pattern p = Pattern.compile("(apple|orange|pear)");
Matcher m = p.matcher(inputString);
while (m.find()) {
   String matched = m.group(1);
   // Do something
}

#3


3  

Here is the most simple solution I found (matching with wildcards):

下面是我找到的最简单的解决方案(与通配符匹配):

boolean a = str.matches(".*\\b(wordA|wordB|wordC|wordD|wordE)\\b.*");