用于长字符串的Java中的正则表达式模式匹配性能

I have a regex that works great(500 nanoseconds) when a match is found, but takes a lot of time (over 3 secs) when there is no match. I suspect this could be because of backtracking. I tried some options, like converting .* to (.*)? based on some documentation, but it didn't help.

当找到匹配时我有一个很好的(500纳秒)正则表达式,但是当没有匹配时需要花费很多时间(超过3秒)。我怀疑这可能是因为回溯。我尝试了一些选项,比如将。*转换为(。*)?基于一些文档,但它没有帮助。

Input: a very long string - 5k chars in some cases.

输入:一个非常长的字符串 - 在某些情况下为5k字符。

Regex to match: .*substring1.*substring2.*

正则表达式匹配:。* substring1。* substring2。*

I am pre-compiling the pattern and re-using the matcher, what else can I try?

我正在预编译模式并重新使用匹配器,我还能尝试什么?

Here's my code snippet - I will be calling this method with millions of different input strings, but just a handful of regex patterns.

这是我的代码片段 - 我将使用数百万个不同的输入字符串调用此方法,但只有少数正则表达式模式。

private static HashMap<String, Pattern> patternMap = new HashMap<String, Pattern>();
private static HashMap<String, Matcher> matcherMap = new HashMap<String, Matcher>();

Here's my method:

这是我的方法:

public static Boolean regex_match(String line, String regex) {
    if (regex == null || line == null) {
      return null;
    }
    if (!patternMap.containsKey(regex)) {
      patternMap.put(regex, Pattern.compile(regex));
      matcherMap.put(regex,patternMap.get(regex).matcher(""));
    }
    return matcherMap.get(regex).reset(line).find(0);
 }

4 个解决方案

#1

Your regex is subject to a problem known as catastrophic backtracking, as you hinted at. Essentially, the first .* will match the entire string, and then backtrack until substring1 matches. This will repeat with substring2. Because substring2 fails, the second .* will need to find another place where substring2 begins to match, and then it will fail again. Each time substring1 matches, we need to check every single place that substring2 might match.

正如你所暗示的那样,你的正则表达式会遇到一个被称为灾难性回溯的问题。基本上,第一个。*将匹配整个字符串,然后回溯直到substring1匹配。这将与substring2重复。因为substring2失败,第二个。*将需要找到substring2开始匹配的另一个地方,然后它将再次失败。每次substring1匹配时,我们都需要检查substring2可能匹配的每个地方。

You already are using pattern.find(), so you can omit the starting and ending .*. Then, changing the inner .* to a .*? could improve the performance by turning the greedy matcher into a lazy one.

您已经在使用pattern.find(),因此您可以省略开始和结束。*。然后,将内部。*更改为。*?可以通过将贪婪的匹配器变成懒惰来改善性能。

This produces: substring1.*?substring2

这会产生:substring1。*?substring2

#2

You can verify that the pattern will match if you use indexOf():

如果使用indexOf(),则可以验证模式是否匹配:

int pos1 = str.indexOf("substring1");
int pos2 = str.indexOf("substring2", pos1);

if(pos1 != -1 && pos2 != -1){
  // regex
}

When the regex doesn't match, you will get catastrophic backtracking. In fact, your pattern is likely doing a lot of backtracking even when there is a match. The .* will eat up the entire string, and then needs to go backwards, reluctantly giving back characters.

当正则表达式不匹配时,您将遭遇灾难性的回溯。事实上,即使有匹配,你的模式很可能会做很多回溯。。*将占用整个字符串,然后需要向后,不情愿地回馈字符。

If your string looks like: substring1 substring2........50000 more characters......, then you will get better performance with the lazy .*?. Please note that (.*)? is NOT the same as .*?.

如果你的字符串看起来像:substring1 substring2 ........ 50000多个字符......,那么你将获得更好的性能与懒惰。*?请注意(。*)?和。*?不一样。

The performance of the regex will vary depending on what the substrings are, and what they're matched against. If your string looks like: substring1........50000 more characters...... substring2, then you will get better performance with the .* that you have.

正则表达式的性能取决于子字符串是什么,以及它们匹配的内容。如果你的字符串看起来像:substring1 ........ 50000多个字符...... substring2,那么你将获得更好的性能。*。

#3

Using String.indexOf() is much faster than Regex if the case is simple enough you can use it. You could recode your problem as:

使用String.indexOf()比Regex快得多,如果案例足够简单就可以使用它。您可以将问题重新编码为:

public static boolean containsStrings(String source, String string1, String string2) {
  long pos1, pos2;
  pos1 = source.indexOf(string1);
  if(pos1 > -1) {
    pos2 = source.indexOf(string2,pos1 + string1.length);
    if(pos2 > pos1 && source.indexOf(string1,pos2 + string2.length) < -1) {
      return true;
    }
  }
  return false;
}

Note that my solution does not deal with the case where string2 is contained in string1, if that is the case you'll need to add that to the logic.

请注意,我的解决方案不处理string1包含在string1中的情况,如果是这种情况,则需要将其添加到逻辑中。

#4

^((?!substring1).)*substring1((?!substring2).)*substring2.*?\Z

Should do it because a string that contains one substring multiple times but not both in order won't backtrack ad nauseam. You can drop the .*?\Z at the end if you don't need the matcher to end at end of input.

应该这样做,因为包含一个子字符串多次而不是两个子字符串的字符串不会回溯广告恶心。如果您不需要匹配器在输入结束时结束,则可以在最后删除。*?\ Z.

#1