在java中使用regex提取两个特定单词之间的子字符串

时间:2022-09-13 07:52:29

I would like to extract sub-string between certain two words using java.

我想用java在两个词之间提取子字符串。

For example:

例如:

This is an important example about regex for my work.

I would like to extract everything between "an" and "for".

我想提取“an”和“for”之间的所有东西。

What I did so far is:

到目前为止,我所做的是:

String sentence = "This is an important example about regex for my work and for me";
Pattern pattern = Pattern.compile("(?<=an).*.(?=for)");
Matcher matcher = pattern.matcher(sentence);

boolean found = false;
while (matcher.find()) {
    System.out.println("I found the text: " + matcher.group().toString());
    found = true;
}
if (!found) {
    System.out.println("I didn't found the text");
}

It works well.

它的工作原理。

But I want to do two additional things

但是我想再做两件事

  1. If the sentence is: This is an important example about regex for my work and for me. I want to extract till the first "for" i.e. important example about regex

    如果句子是:这是关于regex的一个重要示例,对于我的工作和我来说都是如此。我想要提取到第一个“for”,即关于regex的重要示例。

  2. Some times I want to limit the number of words between the pattern to 3 words i.e. important example about

    有时我想把字数限制在3个字之间,也就是重要的例子

Any ideas please?

有什么想法吗?

3 个解决方案

#1


8  

For your first question, make it lazy. You can put a question mark after the quantifier and then the quantifier will match as less as possible.

对于你的第一个问题,让它变得懒惰。你可以在量词后面加上问号,然后量词就会越少越好。

(?<=an).*?(?=for)

I have no idea what the additional . at the end is good for in .*. its unnecessary.

我不知道附加的是什么。最后是in。不必要的。

For your second question you have to define what a "word" is. I would say here probably just a sequence of non whitespace followed by a whitespace. Something like this

对于第二个问题,你必须定义一个单词是什么。我想说,这里可能只是一个非空格序列,后面跟着一个空格。像这样的东西

\S+\s

and repeat this 3 times like this

像这样重复三次

(?<=an)\s(\S+\s){3}(?=for)

To ensure that the pattern mathces on whole words use word boundaries

为了确保整个单词的模式计算使用单词边界

(?<=\ban\b)\s(\S+\s){1,5}(?=\bfor\b)

See it online here on Regexr

在Regexr上在线观看。

{3} will match exactly 3 for a minimum of 1 and a max of 3 do this {1,3}

{3}将匹配3,最小值为1,最大值为3,执行{1,3}

Alternative:

选择:

As dma_k correctly stated in your case here its not necessary to use look behind and look ahead. See here the Matcher documentation about groups

正如dma_k在这里正确地指出的那样,不需要使用look behind和look forward。请参阅关于组的Matcher文档

You can use capturing groups instead. Just put the part you want to extract in brackets and it will be put into a capturing group.

您可以使用捕获组代替。只要把要提取的部分放在括号中,它就会被放入一个捕获组。

\ban\b(.*?)\bfor\b

See it online here on Regexr

在Regexr上在线观看。

You can than access this group like this

你可以像这样访问这个组

System.out.println("I found the text: " + matcher.group(1).toString());
                                                        ^

You have only one pair of brackets, so its simple, just put a 1 into matcher.group(1) to access the first capturing group.

只有一对括号,因此很简单,只需将1放入matcher.group(1)以访问第一个捕获组。

#2


3  

Your regex is "an\\s+(.*?)\\s+for". It extracts all characters between an and for ignoring white spaces (\s+). The question mark means "greedy". It is needed to prevent pattern .* to eat everything including word "for".

你的正则表达式是“一个\ \ s +(. * ?)\ \ s +“。它提取an和之间的所有字符,用于忽略空格(\s+)。问号表示“贪婪”。需要防止模式。*吃所有的东西,包括单词“for”。

#3


2  

public class SubStringBetween {

公开课SubStringBetween {

public static String subStringBetween(String sentence, String before, String after) {

    int startSub = SubStringBetween.subStringStartIndex(sentence, before);
    int stopSub = SubStringBetween.subStringEndIndex(sentence, after);

    String newWord = sentence.substring(startSub, stopSub);
    return newWord;
}

public static int subStringStartIndex(String sentence, String delimiterBeforeWord) {

    int startIndex = 0;
    String newWord = "";
    int x = 0, y = 0;

    for (int i = 0; i < sentence.length(); i++) {
        newWord = "";

        if (sentence.charAt(i) == delimiterBeforeWord.charAt(0)) {
            startIndex = i;
            for (int j = 0; j < delimiterBeforeWord.length(); j++) {
                try {
                    if (sentence.charAt(startIndex) == delimiterBeforeWord.charAt(j)) {
                        newWord = newWord + sentence.charAt(startIndex);
                    }
                    startIndex++;
                } catch (Exception e) {
                }

            }
            if (newWord.equals(delimiterBeforeWord)) {
                x = startIndex;
            }
        }
    }
    return x;
}

public static int subStringEndIndex(String sentence, String delimiterAfterWord) {

    int startIndex = 0;
    String newWord = "";
    int x = 0;

    for (int i = 0; i < sentence.length(); i++) {
        newWord = "";

        if (sentence.charAt(i) == delimiterAfterWord.charAt(0)) {
            startIndex = i;
            for (int j = 0; j < delimiterAfterWord.length(); j++) {
                try {
                    if (sentence.charAt(startIndex) == delimiterAfterWord.charAt(j)) {
                        newWord = newWord + sentence.charAt(startIndex);
                    }
                    startIndex++;
                } catch (Exception e) {
                }

            }
            if (newWord.equals(delimiterAfterWord)) {
                x = startIndex;
                x = x - delimiterAfterWord.length();
            }
        }
    }
    return x;
}

}

}

#1


8  

For your first question, make it lazy. You can put a question mark after the quantifier and then the quantifier will match as less as possible.

对于你的第一个问题,让它变得懒惰。你可以在量词后面加上问号,然后量词就会越少越好。

(?<=an).*?(?=for)

I have no idea what the additional . at the end is good for in .*. its unnecessary.

我不知道附加的是什么。最后是in。不必要的。

For your second question you have to define what a "word" is. I would say here probably just a sequence of non whitespace followed by a whitespace. Something like this

对于第二个问题,你必须定义一个单词是什么。我想说,这里可能只是一个非空格序列,后面跟着一个空格。像这样的东西

\S+\s

and repeat this 3 times like this

像这样重复三次

(?<=an)\s(\S+\s){3}(?=for)

To ensure that the pattern mathces on whole words use word boundaries

为了确保整个单词的模式计算使用单词边界

(?<=\ban\b)\s(\S+\s){1,5}(?=\bfor\b)

See it online here on Regexr

在Regexr上在线观看。

{3} will match exactly 3 for a minimum of 1 and a max of 3 do this {1,3}

{3}将匹配3,最小值为1,最大值为3,执行{1,3}

Alternative:

选择:

As dma_k correctly stated in your case here its not necessary to use look behind and look ahead. See here the Matcher documentation about groups

正如dma_k在这里正确地指出的那样,不需要使用look behind和look forward。请参阅关于组的Matcher文档

You can use capturing groups instead. Just put the part you want to extract in brackets and it will be put into a capturing group.

您可以使用捕获组代替。只要把要提取的部分放在括号中,它就会被放入一个捕获组。

\ban\b(.*?)\bfor\b

See it online here on Regexr

在Regexr上在线观看。

You can than access this group like this

你可以像这样访问这个组

System.out.println("I found the text: " + matcher.group(1).toString());
                                                        ^

You have only one pair of brackets, so its simple, just put a 1 into matcher.group(1) to access the first capturing group.

只有一对括号,因此很简单,只需将1放入matcher.group(1)以访问第一个捕获组。

#2


3  

Your regex is "an\\s+(.*?)\\s+for". It extracts all characters between an and for ignoring white spaces (\s+). The question mark means "greedy". It is needed to prevent pattern .* to eat everything including word "for".

你的正则表达式是“一个\ \ s +(. * ?)\ \ s +“。它提取an和之间的所有字符,用于忽略空格(\s+)。问号表示“贪婪”。需要防止模式。*吃所有的东西,包括单词“for”。

#3


2  

public class SubStringBetween {

公开课SubStringBetween {

public static String subStringBetween(String sentence, String before, String after) {

    int startSub = SubStringBetween.subStringStartIndex(sentence, before);
    int stopSub = SubStringBetween.subStringEndIndex(sentence, after);

    String newWord = sentence.substring(startSub, stopSub);
    return newWord;
}

public static int subStringStartIndex(String sentence, String delimiterBeforeWord) {

    int startIndex = 0;
    String newWord = "";
    int x = 0, y = 0;

    for (int i = 0; i < sentence.length(); i++) {
        newWord = "";

        if (sentence.charAt(i) == delimiterBeforeWord.charAt(0)) {
            startIndex = i;
            for (int j = 0; j < delimiterBeforeWord.length(); j++) {
                try {
                    if (sentence.charAt(startIndex) == delimiterBeforeWord.charAt(j)) {
                        newWord = newWord + sentence.charAt(startIndex);
                    }
                    startIndex++;
                } catch (Exception e) {
                }

            }
            if (newWord.equals(delimiterBeforeWord)) {
                x = startIndex;
            }
        }
    }
    return x;
}

public static int subStringEndIndex(String sentence, String delimiterAfterWord) {

    int startIndex = 0;
    String newWord = "";
    int x = 0;

    for (int i = 0; i < sentence.length(); i++) {
        newWord = "";

        if (sentence.charAt(i) == delimiterAfterWord.charAt(0)) {
            startIndex = i;
            for (int j = 0; j < delimiterAfterWord.length(); j++) {
                try {
                    if (sentence.charAt(startIndex) == delimiterAfterWord.charAt(j)) {
                        newWord = newWord + sentence.charAt(startIndex);
                    }
                    startIndex++;
                } catch (Exception e) {
                }

            }
            if (newWord.equals(delimiterAfterWord)) {
                x = startIndex;
                x = x - delimiterAfterWord.length();
            }
        }
    }
    return x;
}

}

}