如何提高这个regex的效率

时间:2021-11-15 05:46:44

I think my regex pattern I have used could be tidied up and look a little neater but my knowledge of regular expressions is limited. I would like to scan and match a series of letters and numbers on new lines from an input file.

我认为我使用的regex模式可以整理一下,看起来更整洁一些,但是我对正则表达式的了解是有限的。我想扫描并匹配一系列的字母和数字在新的行从输入文件。

import java.io.File;
import java.util.Scanner;

import java.util.regex.*;

public class App {
    public static void main(String[] args) throws Exception {

    if (args.length == 1) {

        String fileName = args[0];
        String fileContent = new Scanner(new File(fileName))
                .useDelimiter("\\Z").next();

        ArrayList<Integer> parsedContent = new ArrayList<>();
        parsedContent = parseContentFromFileContent(fileContent);

        int firstInt = parsedContent.get(0);
        int secondInt = parsedContent.get(1);
        int thirdInt = parsedContent.get(2);
        int fourthInt = parsedContent.get(3);
        int fifthInt = parsedContent.get(4);

        System.out.println("First: " + firstInt);
        System.out.println("Second: " + secondInt);
        System.out.println("Third: " + thirdInt);
        System.out.println("Fourth: " + fourthInt);
        System.out.println("Fifth: " + fifthInt);

        return;
    }
  }

  public static ArrayList<Integer> parseContentFromFileContent(String fileContent) {

    ArrayList<Integer> parsedInts = new ArrayList<>();

    String pattern = "(.+?).((?:\\d*\\.)?\\d+)?\\n..((?:\\d*\\.)?\\d+)?\\n(.+?).((?:\\d*\\.)?\\d+)";
    Pattern p = Pattern.compile(pattern, Pattern.DOTALL);
    Matcher m = p.matcher(fileContent);

    if (m.matches()) {
        // Group 1: Has to match two letters
        switch (m.group(1)) {
            case "ab":
                parsedInts.add(1);
                break;
            case "cd":
                parsedInts.add(2);
                break;
            case "ef":
                parsedInts.add(3);
                break;
        }

        // Group 2: Has to match a number
        parsedInts.add(Integer.parseInt(m.group(2)));

        // Group 3: Has to match a letter
        parsedInts.add(Integer.parseInt(m.group(3)));

        // Group 4: Has to match a single letter
        switch (m.group(4)) {
            case "a":
                parsedInts.add(1);
                break;
            case "b":
                parsedInts.add(2);
                break;
            case "c":
                parsedInts.add(3);
                break;
        }
        // Group 5: Has to match a number
        parsedInts.add(Integer.parseInt(m.group(5)));
    }
    return parsedInts;
  }

}

Input file:

输入文件:

ab-123 // Group 1 - Two letters a-z and Group 2 - Number
A=1    // Group 3 - Always A= [number]
a-1    // Group 4 - Letter a-z and Group 5 - Number

cd-1234
A=2
b-2

ef-12345
a=4
c-3

gh-123456
a=4
d-4

Is there a better (cleaner) regex pattern I could use to capture the data from the file above.

是否有更好的(更干净的)regex模式可以用于从上面的文件捕获数据。

pattern = (.+?).((?:\\d*\\.)?\\d+)?\\n..((?:\\d*\\.)?\\d+)?\\n(.+?).((?:\\d*\\.)?\\d+)

2 个解决方案

#1


2  

Your pattern at the moment isn't very precise, contrary to the description you gave. There are a lot of .+?, but your description quite clearly says two letters or always A= - so you could instead use that in your pattern. Your pattern also accounts for decimal numbers, while there are none in the shown input, so you might be able to drop (?:\\d*\\.)?. Furthermore all your number matching patterns are optional, but according to your description thex shouldn't.

与你的描述相反,你目前的模式不是很精确。有很多。+?,但你的描述中很清楚地写着两个字母或总是A= -,所以你可以在你的模式中使用它。您的模式也计算十进制数,而显示的输入中没有,因此您可能可以删除(? \:\ d*\ . \.)。此外,您所有的数字匹配模式都是可选的,但根据您的描述,x不应该。

If one takes your pattern quite literally, a possible pattern would be

如果你从字面上理解你的模式,一个可能的模式就是

([a-z]{2})-(\\d+)\\n[Aa]=(\\d+)\\n([a-z])-(\\d+)

See https://regex101.com/r/WNxUQa/1

参见https://regex101.com/r/WNxUQa/1

Note that you might have to adjust your pattern a bit (e.g. using ^ and $), if there might be malicious input.

请注意,您可能需要调整你的模式(例如使用^和$),如果可能有恶意的输入。

#2


0  

There is really no such thing as optimizing a regular expression, unless it contains backtracking and you can remove it. You can optimise the way it looks, but all regular expressions that do the same thing compile to the same DFA, or equivalent DFAs, and have the same performance.

实际上不存在优化正则表达式这样的事情,除非它包含回溯,您可以删除它。您可以优化它的外观,但是所有执行相同操作的正则表达式都编译到相同的DFA或等效的DFAs,并且具有相同的性能。

#1


2  

Your pattern at the moment isn't very precise, contrary to the description you gave. There are a lot of .+?, but your description quite clearly says two letters or always A= - so you could instead use that in your pattern. Your pattern also accounts for decimal numbers, while there are none in the shown input, so you might be able to drop (?:\\d*\\.)?. Furthermore all your number matching patterns are optional, but according to your description thex shouldn't.

与你的描述相反,你目前的模式不是很精确。有很多。+?,但你的描述中很清楚地写着两个字母或总是A= -,所以你可以在你的模式中使用它。您的模式也计算十进制数,而显示的输入中没有,因此您可能可以删除(? \:\ d*\ . \.)。此外,您所有的数字匹配模式都是可选的,但根据您的描述,x不应该。

If one takes your pattern quite literally, a possible pattern would be

如果你从字面上理解你的模式,一个可能的模式就是

([a-z]{2})-(\\d+)\\n[Aa]=(\\d+)\\n([a-z])-(\\d+)

See https://regex101.com/r/WNxUQa/1

参见https://regex101.com/r/WNxUQa/1

Note that you might have to adjust your pattern a bit (e.g. using ^ and $), if there might be malicious input.

请注意,您可能需要调整你的模式(例如使用^和$),如果可能有恶意的输入。

#2


0  

There is really no such thing as optimizing a regular expression, unless it contains backtracking and you can remove it. You can optimise the way it looks, but all regular expressions that do the same thing compile to the same DFA, or equivalent DFAs, and have the same performance.

实际上不存在优化正则表达式这样的事情,除非它包含回溯,您可以删除它。您可以优化它的外观,但是所有执行相同操作的正则表达式都编译到相同的DFA或等效的DFAs,并且具有相同的性能。