使用C#Regex匹配嵌套括号

时间:2022-09-13 07:52:53

I want to match a particular set of nested brackets from a grammatical parser's output (named Stanford Parser) as below.

我想从语法分析器的输出(名为Stanford Parser)匹配一组特定的嵌套括号,如下所示。

(ROOT (S (NP (PRP He)) (VP (VBD gave) (NP (PRP me)) (NP (DT a) (NN pen))) (. .)))
(ROOT (S (NP (PRP He)) (VP (VBD said) (SBAR (IN that) (S (NP (PRP he)) (VP (VBD was) (ADJP (JJ hungry)))))) (. .)))
(ROOT (S (NP (PRP I)) (VP (VBD wrote) (NP (PRP him)) (NP (DT a) (JJ long) (NN letter))) (. .)))
(ROOT (S (NP (PRP He)) (VP (VBD provided) (NP (DT the) (JJ old) (NN bagger)) (NP (NP (DT a) (NN lot)) (PP (IN of) (NP (NN food))))) (. .)))

So want to match everything within the (VP...). But there are conditions: (1) It should have 1 (VBD..)and two (NP..) afterwards. The VBD is not a problem.(2) Two sets of NP is the problem. The structure of an NP bracket is not predictable. The only thing predictable is NP and nested brackets like this (NP bla bla bla ). So I want to capture each NP, which involves combining nested brackets with NP. Below regex matches what I want (in this example at least), but it does not have (NP bla bla bla ) part defined. The half finished regex below does not contain this solution I seek, i.e. the NP part with all recursive bracket sub-nodes within it.

所以想要匹配(VP ...)中的所有内容。但有条件:(1)之后应该有1(VBD ..)和2(NP ..)。 VBD不是问题。(2)两组NP是问题。 NP支架的结构是不可预测的。唯一可预测的是NP和这样的嵌套括号(NP bla bla bla)。所以我想捕获每个NP,其中涉及将嵌套括号与NP组合。正则表达式匹配我想要的(至少在这个例子中),但它没有定义(NP bla bla bla)部分。下面的半完整正则表达式不包含我寻求的这个解决方案,即NP部分,其中包含所有递归括号子节点。

\(VP\s+\(V\w+([^()]+|(?<Level>\()|(?<-Level>\)))+(?(Level)(?!))\)

There is something about Balancing Group Definition here, that explains nesting brackets but it does not offer a solution for my problem.

这里有一些关于平衡组定义的内容,它解释了嵌套括号,但它没有为我的问题提供解决方案。

2 个解决方案

#1


0  

Well, I'm not sure I really understood what exactly you wanted, but I'll give it a try. :)

嗯,我不确定我真的明白你究竟想要什么,但我会试一试。 :)

\(VP.*\(V(\w{1,2}).*\(NP.*\){2}\) This matches 4 times with your given example and the one special case you wanted.

\(VP。* \(V(\ w {1,2})。* \(NP。* \){2} \)这与您给定的示例和您想要的一个特殊情况匹配4次。

You might want to check out regexpal.com to check for yourself.

您可能想查看regexpal.com来检查自己。

Edit: I used . (dot) a lot, you might want to be a little stricter.

编辑:我用过。 (点)很多,你可能想要更严格一点。

#2


0  

Nope, sorry. Regex is incredibly useful but you're asking for something that it can't do. Regex is a "deterministic finite automaton" and doesn't have the ability to do counting: https://en.wikipedia.org/wiki/Deterministic_finite_automaton

不,谢谢。正则表达式是非常有用的,但你要求的东西是它无法做到的。正则表达式是一个“确定性有限自动机”,并且无法进行计数:https://en.wikipedia.org/wiki/Deterministic_finite_automaton

So, what you probably want is a simple recursive descent parser that will let you match up parentheses recursively. It's probably less work than what you've expended trying to make regex work, especially for as simple a grammar as you have. For a description and example you can start here: https://en.wikipedia.org/wiki/Recursive_descent_parser

所以,你可能想要的是一个简单的递归下降解析器,它可以让你递归地匹配括号。它可能比你花在试图使正则表达式工作上花费的工作少,特别是对于像你一样简单的语法。有关说明和示例,您可以从这里开始:https://en.wikipedia.org/wiki/Recursive_descent_parser

(Hey, what do you know! Those computer science classes turned out to be useful!)

(嘿,你知道什么!那些计算机科学课程证明是有用的!)

#1


0  

Well, I'm not sure I really understood what exactly you wanted, but I'll give it a try. :)

嗯,我不确定我真的明白你究竟想要什么,但我会试一试。 :)

\(VP.*\(V(\w{1,2}).*\(NP.*\){2}\) This matches 4 times with your given example and the one special case you wanted.

\(VP。* \(V(\ w {1,2})。* \(NP。* \){2} \)这与您给定的示例和您想要的一个特殊情况匹配4次。

You might want to check out regexpal.com to check for yourself.

您可能想查看regexpal.com来检查自己。

Edit: I used . (dot) a lot, you might want to be a little stricter.

编辑:我用过。 (点)很多,你可能想要更严格一点。

#2


0  

Nope, sorry. Regex is incredibly useful but you're asking for something that it can't do. Regex is a "deterministic finite automaton" and doesn't have the ability to do counting: https://en.wikipedia.org/wiki/Deterministic_finite_automaton

不,谢谢。正则表达式是非常有用的,但你要求的东西是它无法做到的。正则表达式是一个“确定性有限自动机”,并且无法进行计数:https://en.wikipedia.org/wiki/Deterministic_finite_automaton

So, what you probably want is a simple recursive descent parser that will let you match up parentheses recursively. It's probably less work than what you've expended trying to make regex work, especially for as simple a grammar as you have. For a description and example you can start here: https://en.wikipedia.org/wiki/Recursive_descent_parser

所以,你可能想要的是一个简单的递归下降解析器,它可以让你递归地匹配括号。它可能比你花在试图使正则表达式工作上花费的工作少,特别是对于像你一样简单的语法。有关说明和示例,您可以从这里开始:https://en.wikipedia.org/wiki/Recursive_descent_parser

(Hey, what do you know! Those computer science classes turned out to be useful!)

(嘿,你知道什么!那些计算机科学课程证明是有用的!)