如何使用布尔值或在regex中

时间:2022-09-19 17:57:29

I want to use a regex to find a substring, followed by a variable number of characters, followed by any of several substrings.

我想使用regex查找子字符串,后面跟着变量数量的字符,后面跟着几个子字符串中的任何一个。

an re.findall of

的re.findall

"ATGTCAGGTAAGCTTAGGGCTTTAGGATT"

should give me:

应该给我:

['ATGTCAGGTAA', 'ATGTCAGGTAAGCTTAG', 'ATGTCAGGTAAGCTTAGGGCTTTAG']

I have tried all of the following without success:

我试过了以下所有没有成功的方法:

import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
re.findall('(ATG.*TAA)|(ATG.*TAG)', string2)
re.findall('ATG.*(TAA|TAG)', string2)
re.findall('ATG.*((TAA)|(TAG))', string2)
re.findall('ATG.*(TAA)|(TAG)', string2)
re.findall('ATG.*(TAA)|ATG.*(TAG)', string2)
re.findall('(ATG.*)(TAA)|(ATG.*)(TAG)', string2)
re.findall('(ATG.*)TAA|(ATG.*)TAG', string2)

What am I missing here?

我错过了什么?

2 个解决方案

#1


3  

This is not super-easy, because a) you want overlapping matches, and b) you want greedy and non-greedy and everything inbetween.

这不太容易,因为a)你想要重叠的匹配,b)你想要贪婪和非贪婪以及介于两者之间的一切。

As long as the strings are fairly short, you can check every substring:

只要字符串相当短,您可以检查每个子字符串:

import re
s = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
p = re.compile(r'ATG.*TA[GA]$')

for start in range(len(s)-6):  # string is at least 6 letters long
    for end in range(start+6, len(s)):
        if p.match(s, pos=start, endpos=end):
            print(s[start:end])

This prints:

这个打印:

ATGTCAGGTAA
ATGTCAGGTAAGCTTAG
ATGTCAGGTAAGCTTAGGGCTTTAG

Since you appear to work with DNA sequences or something like that, make sure to check out Biopython, too.

既然你似乎在研究DNA序列或类似的东西,一定要检查生物马拉松。

#2


0  

I like the accepted answer just fine :-) That is, I'm adding this for info, not looking for points.

我喜欢被接受的答案很好:-)也就是说,我添加这为信息,不是寻找点。

If you have heavy need for this, trying a match on O(N^2) pairs of indices may soon become unbearably slow. One improvement is to use the .search() method to "leap" directly to the only starting indices that can possibly pay off. So the following does that.

如果你有沉重的需要,尝试匹配O(N ^ 2)双指数可能很快就变得极为缓慢。一个改进是使用.search()方法直接“跳跃”到惟一可能有回报的起始索引。

It also uses the .fullmatch() method so that you don't have to artificially change the "natural" regexp (e.g., in your example, no need to add a trailing $ to the regexp - and, indeed, in the following code doing so would no longer work as intended). Note that .fullmatch() was added in Python 3.4, so this code also requires Python 3!

它还使用.fullmatch()方法,以便您不必人为地更改“自然”regexp(例如,在您的示例中,不需要向regexp添加末尾的$——实际上,在下面的代码中,这样做将不再按照预期工作)。注意,在Python 3.4中添加了.fullmatch(),因此该代码还需要Python 3!

Finally, this intends to generalize the re module's finditer() function/method. While you don't need match objects (you just want strings), they're far more generally applicable, and returning a generator is often friendlier than returning a list too.

最后,本文将对re模块的finditer()函数/方法进行推广。虽然您不需要匹配对象(只需要字符串),但是它们更普遍适用,返回生成器通常比返回列表更友好。

So, no, this doesn't do exactly what you want, but does things from which you can get what you want, in Python 3, faster:

所以,不,这并不是你想要的,而是做一些你能得到你想要的东西,在Python 3中,更快:

def finditer_overlap(regexp, string):
    start = 0
    n = len(string)
    while start <= n:
        # don't know whether regexp will find shortest or
        # longest match, but _will_ find leftmost match
        m = regexp.search(string, start)
        if m is None:
            return
        start = m.start()
        for finish in range(start, n+1):
            m = regexp.fullmatch(string, start, finish)
            if m is not None:
                yield m
        start += 1

Then, e.g.,

然后,例如,

import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
pat = re.compile("ATG.*(TAA|TAG)")
for match in finditer_overlap(pat, string2):
    print(match.group())

prints what you wanted in your example. The other ways you tried to write a regexp should also work. In this example it's faster because the second time around the outer loop start is 1, and regexp.search(string, 1) fails to find another match, so the generator exits at once (so skips checking O(N^2) other index pairs).

在示例中打印所需的内容。尝试编写regexp的其他方法也应该有效。在这个例子中,它的速度更快,因为在外部循环的第二次启动是1,而regexp。搜索(字符串,1)未能找到另一个比赛,所以发电机出口一次(所以跳过检查O(N ^ 2)其他指数对)。

#1


3  

This is not super-easy, because a) you want overlapping matches, and b) you want greedy and non-greedy and everything inbetween.

这不太容易,因为a)你想要重叠的匹配,b)你想要贪婪和非贪婪以及介于两者之间的一切。

As long as the strings are fairly short, you can check every substring:

只要字符串相当短,您可以检查每个子字符串:

import re
s = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
p = re.compile(r'ATG.*TA[GA]$')

for start in range(len(s)-6):  # string is at least 6 letters long
    for end in range(start+6, len(s)):
        if p.match(s, pos=start, endpos=end):
            print(s[start:end])

This prints:

这个打印:

ATGTCAGGTAA
ATGTCAGGTAAGCTTAG
ATGTCAGGTAAGCTTAGGGCTTTAG

Since you appear to work with DNA sequences or something like that, make sure to check out Biopython, too.

既然你似乎在研究DNA序列或类似的东西,一定要检查生物马拉松。

#2


0  

I like the accepted answer just fine :-) That is, I'm adding this for info, not looking for points.

我喜欢被接受的答案很好:-)也就是说,我添加这为信息,不是寻找点。

If you have heavy need for this, trying a match on O(N^2) pairs of indices may soon become unbearably slow. One improvement is to use the .search() method to "leap" directly to the only starting indices that can possibly pay off. So the following does that.

如果你有沉重的需要,尝试匹配O(N ^ 2)双指数可能很快就变得极为缓慢。一个改进是使用.search()方法直接“跳跃”到惟一可能有回报的起始索引。

It also uses the .fullmatch() method so that you don't have to artificially change the "natural" regexp (e.g., in your example, no need to add a trailing $ to the regexp - and, indeed, in the following code doing so would no longer work as intended). Note that .fullmatch() was added in Python 3.4, so this code also requires Python 3!

它还使用.fullmatch()方法,以便您不必人为地更改“自然”regexp(例如,在您的示例中,不需要向regexp添加末尾的$——实际上,在下面的代码中,这样做将不再按照预期工作)。注意,在Python 3.4中添加了.fullmatch(),因此该代码还需要Python 3!

Finally, this intends to generalize the re module's finditer() function/method. While you don't need match objects (you just want strings), they're far more generally applicable, and returning a generator is often friendlier than returning a list too.

最后,本文将对re模块的finditer()函数/方法进行推广。虽然您不需要匹配对象(只需要字符串),但是它们更普遍适用,返回生成器通常比返回列表更友好。

So, no, this doesn't do exactly what you want, but does things from which you can get what you want, in Python 3, faster:

所以,不,这并不是你想要的,而是做一些你能得到你想要的东西,在Python 3中,更快:

def finditer_overlap(regexp, string):
    start = 0
    n = len(string)
    while start <= n:
        # don't know whether regexp will find shortest or
        # longest match, but _will_ find leftmost match
        m = regexp.search(string, start)
        if m is None:
            return
        start = m.start()
        for finish in range(start, n+1):
            m = regexp.fullmatch(string, start, finish)
            if m is not None:
                yield m
        start += 1

Then, e.g.,

然后,例如,

import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
pat = re.compile("ATG.*(TAA|TAG)")
for match in finditer_overlap(pat, string2):
    print(match.group())

prints what you wanted in your example. The other ways you tried to write a regexp should also work. In this example it's faster because the second time around the outer loop start is 1, and regexp.search(string, 1) fails to find another match, so the generator exits at once (so skips checking O(N^2) other index pairs).

在示例中打印所需的内容。尝试编写regexp的其他方法也应该有效。在这个例子中,它的速度更快,因为在外部循环的第二次启动是1,而regexp。搜索(字符串,1)未能找到另一个比赛,所以发电机出口一次(所以跳过检查O(N ^ 2)其他指数对)。