正则表达式与可选的文本块

时间:2022-05-22 11:12:57

I'm using regex to parse structured text as below, with caret symbol marking what I'm trying to match:

我正在使用正则表达式解析结构化文本,如下所示,插入符号标记我想要匹配的内容:

block 1
^^^^^^^
    subblock 1.1
        attrib a=a1
    subblock 1.2
        attrib b=b1
                 ^^
block 2
    subblock 2.1
        attrib a=a2
block 3
^^^^^^^
    subblock 3.1
        attrib a=a3
    subblock 3.2
        attrib b=b3
                 ^^

A subblock may or may not appear inside a block, e.g.: subblock 2.2.

子块可以出现在块内,也可以不出现在块内,例如:子块2.2。

The expected match is [(block1,b1), (block3,b3)].

预期的匹配是[(block1,b1),(block3,b3)]。

/(capture block#)[\s\S]*?attrib\sb=(capture b#)/gm

But this ends up matching [(block1, b1), (block2, b3)].

但这最终匹配[(block1,b1),(block2,b3)]。

Where am I doing the regex wrong?

我在哪里做正则表达式错了?

2 个解决方案

#1


2  

You can use

您可以使用

(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)

See the regex demo

请参阅正则表达式演示

The regex is based on an unroll the loop technique. Here is an explanation:

正则表达式基于展开循环技术。这是一个解释:

  • (?m) - multiline modifier to make ^ match the beginning of a line
  • (?m) - 多线修改器使^匹配一行的开头

  • (^block\s*\d+) - match and capture the block + optional whitespace(s) + 1+ digits (Group 1)
  • (^ block \ s * \ d +) - 匹配并捕获块+可选空格+ 1+位(组1)

  • .* - matches the rest of the line (as no DOTALL option should be on)
  • 。* - 匹配行的其余部分(因为没有DOTALL选项应该打开)

  • (?:\n(?!block\s*\d).*)* - match any text after that is not a word block followed with optional whitespace(s) followed with a digit (this way, a boundary is set)
  • (?:\ n(?!block \ s * \ d)。*)* - 匹配之后的任何文本不是一个字块,后跟一个可选的空格后跟一个数字(这样就设置了一个边界)

  • \battrib\s*b=(\w+) - match a whole word attrib followed with 0+ whitespaces, literal b=, and match and capture 1+ alphanumerics or underscore (note: this can be adjusted as per your real data) with (\w+)
  • \ battrib \ s * b =(\ w +) - 匹配整个单词attrib,后跟0+空格,文字b =,匹配并捕获1 +字母数字或下划线(注意:这可以根据您的实际数据进行调整) (\ W +)

Python demo:

import re
p = re.compile(r'(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)')
s = "block 1\n    subblock 1.1\n        attrib a=a1\n    subblock 1.2\n        attrib b=b1\nblock 2\n    subblock 2.1\n        attrib a=a2\nblock 3\n    subblock 3.1\n        attrib a=a3\n    subblock 3.2\n        attrib b=b3"
print(p.findall(s))

#2


0  

What about this regex? https://regex101.com/r/yZ4fL9/1

这个正则表达式怎么样? https://regex101.com/r/yZ4fL9/1

block (\d).*?attrib b=b(\1)

#1


2  

You can use

您可以使用

(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)

See the regex demo

请参阅正则表达式演示

The regex is based on an unroll the loop technique. Here is an explanation:

正则表达式基于展开循环技术。这是一个解释:

  • (?m) - multiline modifier to make ^ match the beginning of a line
  • (?m) - 多线修改器使^匹配一行的开头

  • (^block\s*\d+) - match and capture the block + optional whitespace(s) + 1+ digits (Group 1)
  • (^ block \ s * \ d +) - 匹配并捕获块+可选空格+ 1+位(组1)

  • .* - matches the rest of the line (as no DOTALL option should be on)
  • 。* - 匹配行的其余部分(因为没有DOTALL选项应该打开)

  • (?:\n(?!block\s*\d).*)* - match any text after that is not a word block followed with optional whitespace(s) followed with a digit (this way, a boundary is set)
  • (?:\ n(?!block \ s * \ d)。*)* - 匹配之后的任何文本不是一个字块,后跟一个可选的空格后跟一个数字(这样就设置了一个边界)

  • \battrib\s*b=(\w+) - match a whole word attrib followed with 0+ whitespaces, literal b=, and match and capture 1+ alphanumerics or underscore (note: this can be adjusted as per your real data) with (\w+)
  • \ battrib \ s * b =(\ w +) - 匹配整个单词attrib,后跟0+空格,文字b =,匹配并捕获1 +字母数字或下划线(注意:这可以根据您的实际数据进行调整) (\ W +)

Python demo:

import re
p = re.compile(r'(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)')
s = "block 1\n    subblock 1.1\n        attrib a=a1\n    subblock 1.2\n        attrib b=b1\nblock 2\n    subblock 2.1\n        attrib a=a2\nblock 3\n    subblock 3.1\n        attrib a=a3\n    subblock 3.2\n        attrib b=b3"
print(p.findall(s))

#2


0  

What about this regex? https://regex101.com/r/yZ4fL9/1

这个正则表达式怎么样? https://regex101.com/r/yZ4fL9/1

block (\d).*?attrib b=b(\1)