Java Regex根据每个id的重复子字符串提取id字符串

时间:2022-09-13 07:56:04

I am reading in a log file and extracting certain data contained with in the file. I am able to extract the time for each line of the log file.

我正在读取日志文件并提取文件中包含的某些数据。我能够为日志文件的每一行提取时间。

Now I want to extract the id "ieatrcxb4498-1". All of the id's start with the sub string ieatrcxb which I have tried to query and return the full string based on it.

现在我想提取id“ieatrcxb4498-1”。所有的id都以子字符串ieatrcxb开头,我试图查询它并根据它返回完整的字符串。

I have tried many different suggestions from other posts. But I have been unsuccessful, with the following patterns:

我从其他帖子尝试过很多不同的建议。但我没有成功,具有以下模式:

(?i)\\b("ieatrcxb"(?:.+?)?)\\b
(?i)\\b\\w*"ieatrcxb"\\w*\\b"
^.*ieatrcxb.*$ 

I have also tried to extract the full id based, on the String starting with i and finishing in 1. As they all do.

我也尝试提取完整的id,基于字符串从i开始并在1中完成。正如他们所做的那样。

Line of the log file

日志文件的行

150: 2017-06-14 18:02:21 INFO  monitorinfo           :     Info: Lock VCS on node "ieatrcxb4498-1"

Code

Scanner s = new Scanner(new FileReader(new File("lock-unlock.txt")));
    //Record currentRecord = null;
    ArrayList<Record> list = new ArrayList<>();

    while (s.hasNextLine()) {
        String line = s.nextLine();

        Record newRec = new Record();
        // newRec.time =
        newRec.time = regexChecker("([0-1]?\\d|2[0-3]):([0-5]?\\d):([0-5]?\\d)", line);

        newRec.ID = regexChecker("^.*ieatrcxb.*$", line);

        list.add(newRec);

    }


public static String regexChecker(String regEx, String str2Check) {

    Pattern checkRegex = Pattern.compile(regEx);
    Matcher regexMatcher = checkRegex.matcher(str2Check);
    String regMat = "";
    while(regexMatcher.find()){
        if(regexMatcher.group().length() !=0)
            regMat = regexMatcher.group();
        }
        //System.out.println("Inside the "+ regexMatcher.group().trim());
    }

     return regMat;
}

I need a simple pattern which will do this for me.

我需要一个简单的模式来为我做这件事。

3 个解决方案

#1


1  

Does the ID always have the format "ieatrcxb followed by 4 digits, followed by -, followed by 1 digit"?

ID是否始终具有“ieatrcxb后跟4位数,后跟 - ,后跟1位”的格式?

If that's the case, you can do:

如果是这种情况,你可以这样做:

regexChecker("ieatrcxb\\d{4}-\\d", line);

Note the {4} quantifier, which matches exactly 4 digits (\\d). If the last digit is always 1, you could also use "ieatrcxb\\d{4}-1".

注意{4}量词,它恰好与4位数字匹配(\\ d)。如果最后一位数始终为1,您还可以使用“ieatrcxb \\ d {4} -1”。

If the number of digits vary, you can use "ieatrcxb\\d+-\\d+", where + means "1 or more".

如果位数不同,您可以使用“ieatrcxb \\ d + - \\ d +”,其中+表示“1或更多”。

You can also use the {} quantifier with the mininum and maximum number of occurences. Example: "ieatrcxb\\d{4,6}-\\d" - {4,6} means "minimum of 4 and maximum of 6 occurrences" (that's just an example, I don't know if that's your case). This is useful if you know exactly how many digits the ID can have.

您还可以使用{}量词与最小和最大出现次数。例如:“ieatrcxb \\ d {4,6} - \\ d” - {4,6}表示“最少4次,最多6次”(这只是一个例子,我不知道你的情况是否属实) 。如果您确切知道ID可以有多少位数,这将非常有用。

All of the above work for your case, returning ieatrcxb4498-1. Which one to use will depend on how your input varies.

所有上述工作适用于您的情况,返回ieatrcxb4498-1。使用哪一个取决于您的输入如何变化。


If you want just the numbers without the ieatrcxb part (4498-1), you can use a lookbehind regex:

如果你只想要没有ieatrcxb部分(4498-1)的数字,你可以使用lookbehind正则表达式:

regexChecker("(?<=ieatrcxb)\\d{4,6}-\\d", line);

This makes ieatrcxb to not be part of the match, thus returning just 4498-1.

这使得ieatrcxb不属于匹配,因此只返回4498-1。

If you also don't want the -1 and just 4498, you can combine this with a lookahead:

如果你也不想-1和4498,你可以将它与前瞻结合起来:

regexChecker("(?<=ieatrcxb)\\d{4,6}(?=-\\d)", line)

This returns just 4498.

这只返回4498。

#2


1  

public static void main(String[] args) {
    String line = "150: 2017-06-14 18:02:21 INFO  monitorinfo           :     Info: Lock VCS on node \"ieatrcxb4498-1\"";
    String regex ="ieatrcxb.*1";
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(line);
    while(m.find()){
        System.out.println(m.group());
    }
}

or if the id's are all quoted:

或者如果id都被引用:

 String id = line.substring(line.indexOf("\""), line.lastIndexOf("\"")+1);
 System.out.println(id);

#3


0  

You are trying to do it by very difficult way. If each line of the lock-unlock.txt file is the same like on snippet you posted, you can do following:

你试图通过非常困难的方式来做到这一点。如果lock-unlock.txt文件的每一行与您发布的代码段相同,则可以执行以下操作:

File logFile = new File("lock-unlock.txt");

List<String> lines = Files.readAllLines(logFile.toPath());

List<Integer> ids = lines.stream()
                .filter(line -> line.contains("ieatrcxb"))
                .map(line -> line.split( "\"")[1]) //"ieatrcxb4498-1"
                .map(line -> line.replaceAll("\\D+","")) //"44981"
                .map(Integer::parseInt) // 44981
                .collect( Collectors.toList() );

If you are not looking for just the ID number, just remove/comment second and third .map() method call, but it will result to a List of Strings instead of Integers.

如果您不是只查找ID号,只需删除/注释第二个和第三个.map()方法调用,但它将导致一个字符串列表而不是整数。

#1


1  

Does the ID always have the format "ieatrcxb followed by 4 digits, followed by -, followed by 1 digit"?

ID是否始终具有“ieatrcxb后跟4位数,后跟 - ,后跟1位”的格式?

If that's the case, you can do:

如果是这种情况,你可以这样做:

regexChecker("ieatrcxb\\d{4}-\\d", line);

Note the {4} quantifier, which matches exactly 4 digits (\\d). If the last digit is always 1, you could also use "ieatrcxb\\d{4}-1".

注意{4}量词,它恰好与4位数字匹配(\\ d)。如果最后一位数始终为1,您还可以使用“ieatrcxb \\ d {4} -1”。

If the number of digits vary, you can use "ieatrcxb\\d+-\\d+", where + means "1 or more".

如果位数不同,您可以使用“ieatrcxb \\ d + - \\ d +”,其中+表示“1或更多”。

You can also use the {} quantifier with the mininum and maximum number of occurences. Example: "ieatrcxb\\d{4,6}-\\d" - {4,6} means "minimum of 4 and maximum of 6 occurrences" (that's just an example, I don't know if that's your case). This is useful if you know exactly how many digits the ID can have.

您还可以使用{}量词与最小和最大出现次数。例如:“ieatrcxb \\ d {4,6} - \\ d” - {4,6}表示“最少4次,最多6次”(这只是一个例子,我不知道你的情况是否属实) 。如果您确切知道ID可以有多少位数,这将非常有用。

All of the above work for your case, returning ieatrcxb4498-1. Which one to use will depend on how your input varies.

所有上述工作适用于您的情况,返回ieatrcxb4498-1。使用哪一个取决于您的输入如何变化。


If you want just the numbers without the ieatrcxb part (4498-1), you can use a lookbehind regex:

如果你只想要没有ieatrcxb部分(4498-1)的数字,你可以使用lookbehind正则表达式:

regexChecker("(?<=ieatrcxb)\\d{4,6}-\\d", line);

This makes ieatrcxb to not be part of the match, thus returning just 4498-1.

这使得ieatrcxb不属于匹配,因此只返回4498-1。

If you also don't want the -1 and just 4498, you can combine this with a lookahead:

如果你也不想-1和4498,你可以将它与前瞻结合起来:

regexChecker("(?<=ieatrcxb)\\d{4,6}(?=-\\d)", line)

This returns just 4498.

这只返回4498。

#2


1  

public static void main(String[] args) {
    String line = "150: 2017-06-14 18:02:21 INFO  monitorinfo           :     Info: Lock VCS on node \"ieatrcxb4498-1\"";
    String regex ="ieatrcxb.*1";
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(line);
    while(m.find()){
        System.out.println(m.group());
    }
}

or if the id's are all quoted:

或者如果id都被引用:

 String id = line.substring(line.indexOf("\""), line.lastIndexOf("\"")+1);
 System.out.println(id);

#3


0  

You are trying to do it by very difficult way. If each line of the lock-unlock.txt file is the same like on snippet you posted, you can do following:

你试图通过非常困难的方式来做到这一点。如果lock-unlock.txt文件的每一行与您发布的代码段相同,则可以执行以下操作:

File logFile = new File("lock-unlock.txt");

List<String> lines = Files.readAllLines(logFile.toPath());

List<Integer> ids = lines.stream()
                .filter(line -> line.contains("ieatrcxb"))
                .map(line -> line.split( "\"")[1]) //"ieatrcxb4498-1"
                .map(line -> line.replaceAll("\\D+","")) //"44981"
                .map(Integer::parseInt) // 44981
                .collect( Collectors.toList() );

If you are not looking for just the ID number, just remove/comment second and third .map() method call, but it will result to a List of Strings instead of Integers.

如果您不是只查找ID号,只需删除/注释第二个和第三个.map()方法调用,但它将导致一个字符串列表而不是整数。