Python正则表达式:在字符串中查找字符序列

时间:2022-09-13 16:28:19

I'm using python and regex (new to both) to find sequence of chars in a string as follows: Grab the first instance of p followed by any number (It'll always be in the form of p_ _ where _ and _ will be integers). Then either find an 's' or a 'go' then all integers till the end of the string. For example:

我正在使用python和regex(两者都是新的)来查找字符串中的字符序列,如下所示:抓住p的第一个实例,后跟任意数字(它总是以p _ _的形式,其中_和_将是整数)。然后找到's'或'go'然后找到所有整数直到字符串的结尾。例如:

ascjksdcvyp12nbvnzxcmgonbmbh12hjg23

should yield p12 go 12 23.

应该产生p12去12 23。

ascjksdcvyp12nbvnzxcmsnbmbh12hjg23

should yield p12 s 12 23.

应该产生p12 s 12 23。

I've only managed to get the p12 part of the string and this is what I've tried so far to extract the 'go' or 's':

我只是设法获得字符串的p12部分,这是我迄今为止尝试提取'go'或's':

decoded = (re.findall(r'([p][0-9]*)',myStr))
print(decoded) //prints p12 

I know by doing something like

我知道做类似的事情

re.findall(r'[s]|[go]',myStr)

will give me all occurrences of s and g and o, but something like that is not what I'm looking for. And I'm not sure how I'd combine these regexes to get the desired output.

会给我所有出现的s和g和o,但这样的东西不是我想要的。而且我不确定如何将这些正则表达式组合起来以获得所需的输出。

2 个解决方案

#1


2  

Use re.findall with pattern grouping:

将re.findall与模式分组一起使用:

>>> string = 'ascjksdcvyp12nbvnzxcmgonbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 'go', '12', '23')]

>>> string = 'ascjksdcvyp12nbvnzxcmsnbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 's', '12', '23')]
  • With re.findall we are only willing to get what are matched by pattern grouping ()

    使用re.findall,我们只愿意获得模式分组匹配的内容()

  • p\d{2} matches any two digits after p

    p \ d {2}匹配p后的任意两位数

  • After that .* matches anything

    之后。*匹配任何东西

  • Then, s|go matches either s or go

    然后,s | go匹配s或go

  • \D* matches any number of non-digits

    \ D *匹配任意数量的非数字

  • \d+ indicates one or more digits

    \ d +表示一个或多个数字

  • (?:) is a non-capturing group i.e. the match inside won't show up in the output, it is only for the sake of grouping tokens

    (?:)是一个非捕获组,即内部匹配不会出现在输出中,它只是为了分组令牌

Note:

>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+?', string)
[('p12', 's', '12')]

>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+', string)
[('p12', 's', '23')]

I would like to use one of the above two as matching later digits is kind of a repeated task but there are problems with both non-greedy and greedy matches, hence we need to match the digits after s or go well, kind of explicitly.

我想使用上面两个中的一个,因为匹配后面的数字是一种重复的任务,但是非贪婪和贪婪的匹配都存在问题,因此我们需要匹配s之后的数字或顺利进行,明确地说。

#2


0  

First, try to match your line with a minimal pattern, as a test. Use (grouping) and (?:nongrouping) parens to capture the interesting parts and not capture the uninteresting parts. Store away what you care about, then chop off the remainder of the string and search for numbers as a second step.

首先,尝试将您的线与最小模式匹配作为测试。使用(分组)和(?:nongrouping)parens捕获有趣的部分而不捕获不感兴趣的部分。存放掉你关心的东西,然后砍掉剩余的字符串并作为第二步搜索数字。

import re
simple_test = r'^.*p(\d{2}).*?(?:s|go).*?(\d+)'
m = re.match(simple_test, line)
if m is not None:
    p_num = m.group(1)
    trailing_numbers = [m.group(2)]

    remainder = line[m.end()+1:]
    trailing_numbers.extend(               # extend list by appending
        map(                               # list from applying
            lambda m: m.group(1),          # get group(1) from match
            re.finditer(r"(\d+)", remainder) # of each number in string
        )
    )

    print("P:", p_num, "Numbers:", trailing_numbers)

#1


2  

Use re.findall with pattern grouping:

将re.findall与模式分组一起使用:

>>> string = 'ascjksdcvyp12nbvnzxcmgonbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 'go', '12', '23')]

>>> string = 'ascjksdcvyp12nbvnzxcmsnbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 's', '12', '23')]
  • With re.findall we are only willing to get what are matched by pattern grouping ()

    使用re.findall,我们只愿意获得模式分组匹配的内容()

  • p\d{2} matches any two digits after p

    p \ d {2}匹配p后的任意两位数

  • After that .* matches anything

    之后。*匹配任何东西

  • Then, s|go matches either s or go

    然后,s | go匹配s或go

  • \D* matches any number of non-digits

    \ D *匹配任意数量的非数字

  • \d+ indicates one or more digits

    \ d +表示一个或多个数字

  • (?:) is a non-capturing group i.e. the match inside won't show up in the output, it is only for the sake of grouping tokens

    (?:)是一个非捕获组,即内部匹配不会出现在输出中,它只是为了分组令牌

Note:

>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+?', string)
[('p12', 's', '12')]

>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+', string)
[('p12', 's', '23')]

I would like to use one of the above two as matching later digits is kind of a repeated task but there are problems with both non-greedy and greedy matches, hence we need to match the digits after s or go well, kind of explicitly.

我想使用上面两个中的一个,因为匹配后面的数字是一种重复的任务,但是非贪婪和贪婪的匹配都存在问题,因此我们需要匹配s之后的数字或顺利进行,明确地说。

#2


0  

First, try to match your line with a minimal pattern, as a test. Use (grouping) and (?:nongrouping) parens to capture the interesting parts and not capture the uninteresting parts. Store away what you care about, then chop off the remainder of the string and search for numbers as a second step.

首先,尝试将您的线与最小模式匹配作为测试。使用(分组)和(?:nongrouping)parens捕获有趣的部分而不捕获不感兴趣的部分。存放掉你关心的东西,然后砍掉剩余的字符串并作为第二步搜索数字。

import re
simple_test = r'^.*p(\d{2}).*?(?:s|go).*?(\d+)'
m = re.match(simple_test, line)
if m is not None:
    p_num = m.group(1)
    trailing_numbers = [m.group(2)]

    remainder = line[m.end()+1:]
    trailing_numbers.extend(               # extend list by appending
        map(                               # list from applying
            lambda m: m.group(1),          # get group(1) from match
            re.finditer(r"(\d+)", remainder) # of each number in string
        )
    )

    print("P:", p_num, "Numbers:", trailing_numbers)