如何在python中使用内联正则表达式修饰符

时间:2022-09-02 10:01:35

I have a regex:

我有一个正则表达式:

(.*\n)+DOCUMENTATION.*(\"\"\"|''')\n-*\n?((.*\n)+?)(\2)(?s:.*)

witch I'm trying to process some files like this:

女巫我正在尝试处理这样的文件:

#!/usr/bin/python
# -*- coding: utf-8 -*-

# <GNU license here>

DOCUMENTATION = """
module: foo
short_description: baz
<some more here>    
"""

<rest of the python code>

I need to get the DOCUMENTATION part from it.

我需要从中获取DOCUMENTATION部分。

It work quite well but not with python. The problem is with inline modifier ?s:.* which I used to catch rest of the file (any character including new-line zero or more times). Looks that it's somehow different in python.

它工作得很好但不是python。问题在于内联修饰符:。*我用它来捕获文件的其余部分(任何字符包括换行零次或多次)。看起来它在python中有些不同。

Here at regex101 is the example. It shows an error when I switch it to python.

这里以regex101为例。当我将它切换到python时显示错误。

NOTE: I can't set modifiers globally. (I can only pass regex rule to some python module).

注意:我无法全局设置修饰符。 (我只能将正则表达式规则传递给某些python模块)。

1 个解决方案

#1


9  

Inline Modifiers in the re module

Python implements inline (embedded) modifiers, such as (?s), (?i) or (?aiLmsux), but not as part of a non-capturing group modifier like you were trying to use.
(?smi:subpattern) works in Perl and PCRE, but not in Python.

Python实现了内联(嵌入)修饰符,例如(?s),(?i)或(?aiLmsux),但不是像你试图使用的非捕获组修饰符的一部分。 (?smi:subpattern)适用于Perl和PCRE,但不适用于Python。

Moreover, using an inline modifier anywhere in the pattern applies to the whole match and it can't be turned off.

此外,在模式中的任何位置使用内联修饰符都适用于整个匹配,并且无法关闭。

From regular-expressions.info:
In Python, putting a modifier in the middle of the regex affects the whole regex. So in Python, (?i)caseless and caseless(?i) are both case insensitive.

来自regular-expressions.info:在Python中,将修饰符放在正则表达式的中间会影响整个正则表达式。所以在Python中,(?i)无外壳和无外壳(?i)都不区分大小写。


Example:

import re

text = "A\nB"
print("Text: '%s'\n---" % text)
patterns = [ "a", "a(?i)", "A.*B", "A(?s).*B", "A.*(?s)B"]

for p in patterns:
    match = re.search( p, text)
    print("Pattern: '%s'    \tMatch: %s" % (p, match.span() if match else None))

Output:

Text: 'A
B'
---
Pattern: 'a'            Match: None
Pattern: 'a(?i)'        Match: (0, 1)
Pattern: 'A.*B'         Match: None
Pattern: 'A(?s).*B'     Match: (0, 3)
Pattern: 'A.*(?s)B'     Match: (0, 3)

ideone Demo


Solution

(?s) (aka singleline or re.DOTALL) makes . also match newlines. And since you're trying to set it to only a part of the pattern, there are 2 alternatives:

(?s)(又名singleline或re.DOTALL)制作。也匹配换行。而且由于您尝试将其设置为模式的一部分,因此有两种选择:

  1. Match anything except newlines:
    Set (?s) for the whole pattern (either passed as flag or inline), and use [^\n]* instead of a dot, to match any characters except newlines.
  2. 匹配除换行之外的任何内容:为整个模式设置(?s)(作为标记或内联传递),并使用[^ \ n] *而不是点,以匹配除换行之外的任何字符。

  3. Match everything including newlines:
    Use [\S\s]* instead of a dot, to match any characters including newlines. The character class includes all whitespace and all that is not a whitespace (thus, all characters).
  4. 匹配包括换行符在内的所有内容:使用[\ S \ s] *而不是点,以匹配包括换行符在内的任何字符。字符类包括所有空格和所有不是空格(因此,所有字符)。


For the specific case you presented, you can use the following expression:

对于您提供的特定情况,您可以使用以下表达式:

(?m)^DOCUMENTATION.*(\"{3}|'{3})\n-*\n?([\s\S]+?)^\1[\s\S]*

regex101 Demo


Note: This post covers inline modifiers in the re module, whereas Matthew Barnett's regex module does in fact implement inline modifiers (scoped flags) with the same behaviour observed in PCRE and Perl.

注意:这篇文章介绍了re模块中的内联修饰符,而Matthew Barnett的正则表达式模块实际上实现了内联修饰符(作用域标志),其行为与PCRE和Perl中观察到的行为相同。

#1


9  

Inline Modifiers in the re module

Python implements inline (embedded) modifiers, such as (?s), (?i) or (?aiLmsux), but not as part of a non-capturing group modifier like you were trying to use.
(?smi:subpattern) works in Perl and PCRE, but not in Python.

Python实现了内联(嵌入)修饰符,例如(?s),(?i)或(?aiLmsux),但不是像你试图使用的非捕获组修饰符的一部分。 (?smi:subpattern)适用于Perl和PCRE,但不适用于Python。

Moreover, using an inline modifier anywhere in the pattern applies to the whole match and it can't be turned off.

此外,在模式中的任何位置使用内联修饰符都适用于整个匹配,并且无法关闭。

From regular-expressions.info:
In Python, putting a modifier in the middle of the regex affects the whole regex. So in Python, (?i)caseless and caseless(?i) are both case insensitive.

来自regular-expressions.info:在Python中,将修饰符放在正则表达式的中间会影响整个正则表达式。所以在Python中,(?i)无外壳和无外壳(?i)都不区分大小写。


Example:

import re

text = "A\nB"
print("Text: '%s'\n---" % text)
patterns = [ "a", "a(?i)", "A.*B", "A(?s).*B", "A.*(?s)B"]

for p in patterns:
    match = re.search( p, text)
    print("Pattern: '%s'    \tMatch: %s" % (p, match.span() if match else None))

Output:

Text: 'A
B'
---
Pattern: 'a'            Match: None
Pattern: 'a(?i)'        Match: (0, 1)
Pattern: 'A.*B'         Match: None
Pattern: 'A(?s).*B'     Match: (0, 3)
Pattern: 'A.*(?s)B'     Match: (0, 3)

ideone Demo


Solution

(?s) (aka singleline or re.DOTALL) makes . also match newlines. And since you're trying to set it to only a part of the pattern, there are 2 alternatives:

(?s)(又名singleline或re.DOTALL)制作。也匹配换行。而且由于您尝试将其设置为模式的一部分,因此有两种选择:

  1. Match anything except newlines:
    Set (?s) for the whole pattern (either passed as flag or inline), and use [^\n]* instead of a dot, to match any characters except newlines.
  2. 匹配除换行之外的任何内容:为整个模式设置(?s)(作为标记或内联传递),并使用[^ \ n] *而不是点,以匹配除换行之外的任何字符。

  3. Match everything including newlines:
    Use [\S\s]* instead of a dot, to match any characters including newlines. The character class includes all whitespace and all that is not a whitespace (thus, all characters).
  4. 匹配包括换行符在内的所有内容:使用[\ S \ s] *而不是点,以匹配包括换行符在内的任何字符。字符类包括所有空格和所有不是空格(因此,所有字符)。


For the specific case you presented, you can use the following expression:

对于您提供的特定情况,您可以使用以下表达式:

(?m)^DOCUMENTATION.*(\"{3}|'{3})\n-*\n?([\s\S]+?)^\1[\s\S]*

regex101 Demo


Note: This post covers inline modifiers in the re module, whereas Matthew Barnett's regex module does in fact implement inline modifiers (scoped flags) with the same behaviour observed in PCRE and Perl.

注意:这篇文章介绍了re模块中的内联修饰符,而Matthew Barnett的正则表达式模块实际上实现了内联修饰符(作用域标志),其行为与PCRE和Perl中观察到的行为相同。