
时间:2022-09-13 23:36:33

I need a regular expression to select all the text between two outer brackets.


Example: some text(text here(possible text)text(possible text(more text)))end text


Result: (text here(possible text)text(possible text(more text)))


I've been trying for hours, mind you my regular expression knowledge isn't what I'd like it to be :-) so any help will be gratefully received.


15 个解决方案



Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.


But there is a simple algorithm to do this, which I described in this answer to a previous question.




You can use regex recursion:





I want to add this answer for quickreference. Feel free to update.


.NET Regex using balancing groups.

.NET Regex使用平衡组。


Where c is used as the depth counter.


Demo at Regexstorm.com


PCRE using a recursive pattern.



Demo at regex101; Or without alternation:



Demo at regex101; Or unrolled for performance:



Demo at regex101; The pattern is pasted at (?R) which represents (?0).


Perl, PHP, Notepad++, R: perl=TRUE, Python: Regex package with (?V1) for Perl behaviour.

Perl,PHP,Notepad ++,R:perl = TRUE,Python:带有(?V1)的Regex包用于Perl行为。

Ruby using subexpression calls.


With Ruby 2.0 \g<0> can be used to call full pattern.

使用Ruby 2.0 \ g <0>可用于调用完整模式。


Demo at Rubular; Ruby 1.9 only supports capturing group recursion:

在Rubular演示; Ruby 1.9仅支持捕获组递归:


Demo at Rubular  (atomic grouping since Ruby 1.9.3)

Rubular演示(Ruby 1.9.3以来的原子分组)

JavaScript  API :: XRegExp.matchRecursive

JavaScript API :: XRegExp.matchRecursive

XRegExp.matchRecursive(str, '\\(', '\\)', 'g');

JS, Java and other regex flavors without recursion up to 2 levels of nesting:



Demo at regex101. Deeper nesting needs to be added to pattern.
To fail faster on unbalanced parenthesis drop the + quantifier.


Java: An interesting idea using forward references by @jaytea.


Reference - What does this regex mean?

参考 - 这个正则表达式意味着什么?




[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.

[^ \(] *匹配字符串开头不是左括号的所有内容,(\(。* \))捕获括号中括起来的所需子字符串,[^ \]] *匹配所有的东西。在字符串末尾的一个闭合括号。请注意,此表达式不会尝试匹配括号;一个简单的解析器(见dehmann的答案)会更适合它。




If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).


This regex just returns the text between the first opening and the last closing parentheses in your string.


(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.




It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.


You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.


Angle brackets <> were used because they do not require escaping.


The regular expression looks like this:


<[^<>]*(    (        (?<Open><)        [^<>]*    )+    (        (?<Close-Open>>)        [^<>]*    )+)*(?(Open)(?!))>



This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.


Regular expressions can not do this.


Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.



In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:


      0     1     1     0-> S1 -> S2 -> S2 -> S2 ->S1

In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.


In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.


However, an algorithm can be written to achieve the goal. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store something. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.

但是,可以编写算法来实现目标。算法通常属于下推自动机(PDA)。 PDA比FSA高出一级。 PDA有一个额外的堆栈来存储东西。 PDA可用于解决上述问题,因为我们可以“推”堆栈中的左括号,并在遇到右括号时“弹出”它们。如果最后,stack为空,则打开括号和右括号匹配。否则不是。

A detailed discussion can be found here.




This is the definitive regex:


\((?<arguments> (    ([^\(\)']*) |    (\([^\(\)']*\)) |  '(.*?)')*)\)


input: ( arg1, arg2, arg3, (arg4), '(pip' )output: arg1, arg2, arg3, (arg4), '(pip'

note that the '(pip' is correctly managed as string.(tried in regulator: http://sourceforge.net/projects/regulator/)




I have written a little javascript library called balanced to help with this task, you can accomplish this by doing


balanced.matches({    source: source,    open: '(',    close: ')'});

you can even do replacements


balanced.replacements({    source: source,    open: '(',    close: ')',    replace: function (source, head, tail) {        return head + source + tail;    }});

heres a more complex and interactive example JSFiddle




The regular expression using Ruby (version 1.9.3 or above):



Demo on rubular




so you need first and last parenthess, use smth like thisstr.indexOf('('); - it will give you first occurancestr.lastIndexOf(')'); - last one

所以你需要第一个和最后一个parenthess,使用smth就像这个.str.indexOf('('); - 它会给你第一次出现的事情.lastIndexOf(')'); - 最后一个

so u need string between, String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');

所以你需要之间的字符串,String searchingString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');



Here is a customizable solution allowing single character literal delimiters in Java:


public static List<String> getBalancedSubstrings(String s, Character markStart,                                  Character markEnd, Boolean includeMarkers) {        List<String> subTreeList = new ArrayList<String>();        int level = 0;        int lastOpenDelimiter = -1;        for (int i = 0; i < s.length(); i++) {            char c = s.charAt(i);            if (c == markStart) {                level++;                if (level == 1) {                    lastOpenDelimiter = (includeMarkers ? i : i + 1);                }            }            else if (c == markEnd) {                if (level == 1) {                    subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));                }                if (level > 0) level--;            }        }        return subTreeList;    }}

Sample usage:

String s = "some text(text here(possible text)text(possible text(more text)))end text";List<String> balanced = getBalancedSubstrings(s, '(', ')', true);System.out.println("Balanced substrings:\n" + balanced);// => [(text here(possible text)text(possible text(more text)))]



The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.


If you need to match matching nested brackets, then you need something more than regular expressions. - see @dehmann

如果您需要匹配匹配的嵌套括号,那么您需要的不仅仅是正则表达式。 - 见@dehmann

If it's just first open to last close see @Zach


Decide what you want to happen with:


abc ( 123 ( foobar ) def ) xyz ) ghij

You need to decide what your code needs to match in this case.




"""Here is a simple python program showing how to use regularexpressions to write a paren-matching recursive parser.This parser recognises items enclosed by parens, brackets,braces and <> symbols, but is adaptable to any set ofopen/close patterns.  This is where the re package greatlyassists in parsing. """import re# The pattern below recognises a sequence consisting of:#    1. Any characters not in the set of open/close strings.#    2. One of the open/close strings.#    3. The remainder of the string.# # There is no reason the opening pattern can't be the# same as the closing pattern, so quoted strings can# be included.  However quotes are not ignored inside# quotes.  More logic is needed for that....pat = re.compile("""    ( .*? )    ( \( | \) | \[ | \] | \{ | \} | \< | \> |                           \' | \" | BEGIN | END | $ )    ( .* )    """, re.X)# The keys to the dictionary below are the opening strings,# and the values are the corresponding closing strings.# For example "(" is an opening string and ")" is its# closing string.matching = { "(" : ")",             "[" : "]",             "{" : "}",             "<" : ">",             '"' : '"',             "'" : "'",             "BEGIN" : "END" }# The procedure below matches string s and returns a# recursive list matching the nesting of the open/close# patterns in s.def matchnested(s, term=""):    lst = []    while True:        m = pat.match(s)        if m.group(1) != "":            lst.append(m.group(1))        if m.group(2) == term:            return lst, m.group(3)        if m.group(2) in matching:            item, s = matchnested(m.group(3), matching[m.group(2)])            lst.append(m.group(2))            lst.append(item)            lst.append(matching[m.group(2)])        else:            raise ValueError("After <<%s %s>> expected %s not %s" %                             (lst, s, term, m.group(2)))# Unit test.if __name__ == "__main__":    for s in ("simple string",              """ "double quote" """,              """ 'single quote' """,              "one'two'three'four'five'six'seven",              "one(two(three(four)five)six)seven",              "one(two(three)four)five(six(seven)eight)nine",              "one(two)three[four]five{six}seven<eight>nine",              "one(two[three{four<five>six}seven]eight)nine",              "oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",              "ERROR testing ((( mismatched ))] parens"):        print "\ninput", s        try:            lst, s = matchnested(s)            print "output", lst        except ValueError as e:            print str(e)    print "done"



This one also worked


re.findall(r'\(.+\)', s)



Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.


But there is a simple algorithm to do this, which I described in this answer to a previous question.




You can use regex recursion:





I want to add this answer for quickreference. Feel free to update.


.NET Regex using balancing groups.

.NET Regex使用平衡组。


Where c is used as the depth counter.


Demo at Regexstorm.com


PCRE using a recursive pattern.



Demo at regex101; Or without alternation:



Demo at regex101; Or unrolled for performance:



Demo at regex101; The pattern is pasted at (?R) which represents (?0).


Perl, PHP, Notepad++, R: perl=TRUE, Python: Regex package with (?V1) for Perl behaviour.

Perl,PHP,Notepad ++,R:perl = TRUE,Python:带有(?V1)的Regex包用于Perl行为。

Ruby using subexpression calls.


With Ruby 2.0 \g<0> can be used to call full pattern.

使用Ruby 2.0 \ g <0>可用于调用完整模式。


Demo at Rubular; Ruby 1.9 only supports capturing group recursion:

在Rubular演示; Ruby 1.9仅支持捕获组递归:


Demo at Rubular  (atomic grouping since Ruby 1.9.3)

Rubular演示(Ruby 1.9.3以来的原子分组)

JavaScript  API :: XRegExp.matchRecursive

JavaScript API :: XRegExp.matchRecursive

XRegExp.matchRecursive(str, '\\(', '\\)', 'g');

JS, Java and other regex flavors without recursion up to 2 levels of nesting:



Demo at regex101. Deeper nesting needs to be added to pattern.
To fail faster on unbalanced parenthesis drop the + quantifier.


Java: An interesting idea using forward references by @jaytea.


Reference - What does this regex mean?

参考 - 这个正则表达式意味着什么?




[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.

[^ \(] *匹配字符串开头不是左括号的所有内容,(\(。* \))捕获括号中括起来的所需子字符串,[^ \]] *匹配所有的东西。在字符串末尾的一个闭合括号。请注意,此表达式不会尝试匹配括号;一个简单的解析器(见dehmann的答案)会更适合它。




If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).


This regex just returns the text between the first opening and the last closing parentheses in your string.


(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.




It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.


You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.


Angle brackets <> were used because they do not require escaping.


The regular expression looks like this:


<[^<>]*(    (        (?<Open><)        [^<>]*    )+    (        (?<Close-Open>>)        [^<>]*    )+)*(?(Open)(?!))>



This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.


Regular expressions can not do this.


Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.



In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:


      0     1     1     0-> S1 -> S2 -> S2 -> S2 ->S1

In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.


In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.


However, an algorithm can be written to achieve the goal. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store something. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.

但是,可以编写算法来实现目标。算法通常属于下推自动机(PDA)。 PDA比FSA高出一级。 PDA有一个额外的堆栈来存储东西。 PDA可用于解决上述问题,因为我们可以“推”堆栈中的左括号,并在遇到右括号时“弹出”它们。如果最后,stack为空,则打开括号和右括号匹配。否则不是。

A detailed discussion can be found here.




This is the definitive regex:


\((?<arguments> (    ([^\(\)']*) |    (\([^\(\)']*\)) |  '(.*?)')*)\)


input: ( arg1, arg2, arg3, (arg4), '(pip' )output: arg1, arg2, arg3, (arg4), '(pip'

note that the '(pip' is correctly managed as string.(tried in regulator: http://sourceforge.net/projects/regulator/)




I have written a little javascript library called balanced to help with this task, you can accomplish this by doing


balanced.matches({    source: source,    open: '(',    close: ')'});

you can even do replacements


balanced.replacements({    source: source,    open: '(',    close: ')',    replace: function (source, head, tail) {        return head + source + tail;    }});

heres a more complex and interactive example JSFiddle




The regular expression using Ruby (version 1.9.3 or above):



Demo on rubular




so you need first and last parenthess, use smth like thisstr.indexOf('('); - it will give you first occurancestr.lastIndexOf(')'); - last one

所以你需要第一个和最后一个parenthess,使用smth就像这个.str.indexOf('('); - 它会给你第一次出现的事情.lastIndexOf(')'); - 最后一个

so u need string between, String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');

所以你需要之间的字符串,String searchingString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');



Here is a customizable solution allowing single character literal delimiters in Java:


public static List<String> getBalancedSubstrings(String s, Character markStart,                                  Character markEnd, Boolean includeMarkers) {        List<String> subTreeList = new ArrayList<String>();        int level = 0;        int lastOpenDelimiter = -1;        for (int i = 0; i < s.length(); i++) {            char c = s.charAt(i);            if (c == markStart) {                level++;                if (level == 1) {                    lastOpenDelimiter = (includeMarkers ? i : i + 1);                }            }            else if (c == markEnd) {                if (level == 1) {                    subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));                }                if (level > 0) level--;            }        }        return subTreeList;    }}

Sample usage:

String s = "some text(text here(possible text)text(possible text(more text)))end text";List<String> balanced = getBalancedSubstrings(s, '(', ')', true);System.out.println("Balanced substrings:\n" + balanced);// => [(text here(possible text)text(possible text(more text)))]



The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.


If you need to match matching nested brackets, then you need something more than regular expressions. - see @dehmann

如果您需要匹配匹配的嵌套括号,那么您需要的不仅仅是正则表达式。 - 见@dehmann

If it's just first open to last close see @Zach


Decide what you want to happen with:


abc ( 123 ( foobar ) def ) xyz ) ghij

You need to decide what your code needs to match in this case.




"""Here is a simple python program showing how to use regularexpressions to write a paren-matching recursive parser.This parser recognises items enclosed by parens, brackets,braces and <> symbols, but is adaptable to any set ofopen/close patterns.  This is where the re package greatlyassists in parsing. """import re# The pattern below recognises a sequence consisting of:#    1. Any characters not in the set of open/close strings.#    2. One of the open/close strings.#    3. The remainder of the string.# # There is no reason the opening pattern can't be the# same as the closing pattern, so quoted strings can# be included.  However quotes are not ignored inside# quotes.  More logic is needed for that....pat = re.compile("""    ( .*? )    ( \( | \) | \[ | \] | \{ | \} | \< | \> |                           \' | \" | BEGIN | END | $ )    ( .* )    """, re.X)# The keys to the dictionary below are the opening strings,# and the values are the corresponding closing strings.# For example "(" is an opening string and ")" is its# closing string.matching = { "(" : ")",             "[" : "]",             "{" : "}",             "<" : ">",             '"' : '"',             "'" : "'",             "BEGIN" : "END" }# The procedure below matches string s and returns a# recursive list matching the nesting of the open/close# patterns in s.def matchnested(s, term=""):    lst = []    while True:        m = pat.match(s)        if m.group(1) != "":            lst.append(m.group(1))        if m.group(2) == term:            return lst, m.group(3)        if m.group(2) in matching:            item, s = matchnested(m.group(3), matching[m.group(2)])            lst.append(m.group(2))            lst.append(item)            lst.append(matching[m.group(2)])        else:            raise ValueError("After <<%s %s>> expected %s not %s" %                             (lst, s, term, m.group(2)))# Unit test.if __name__ == "__main__":    for s in ("simple string",              """ "double quote" """,              """ 'single quote' """,              "one'two'three'four'five'six'seven",              "one(two(three(four)five)six)seven",              "one(two(three)four)five(six(seven)eight)nine",              "one(two)three[four]five{six}seven<eight>nine",              "one(two[three{four<five>six}seven]eight)nine",              "oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",              "ERROR testing ((( mismatched ))] parens"):        print "\ninput", s        try:            lst, s = matchnested(s)            print "output", lst        except ValueError as e:            print str(e)    print "done"



This one also worked


re.findall(r'\(.+\)', s)