为什么re.findall没有具体在字符串中查找三元组项目。蟒蛇

时间:2023-01-09 22:33:33

So I have four lines of code

所以我有四行代码

seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'



OR_0 = re.findall(r'ATG(?:...){9,}?(?:TAA|TAG|TGA)',seq)  

Let me explain what I am attempting to do first . . . I'm sorry if this confusing but I am going to try my best to explain it.

让我先解释一下我想要做的事情。 。 。如果这令人困惑,我很抱歉,但我会尽力解释。

So I'm looking for sequences that START with 'ATG' followed by units of 3 of any word char [e.g. 'GGG','GTT','TTA',etc] until it encounters either an 'TAA','TAG' or 'TGA' I also want them to be at least 30 characters long. . . hence the {9,}?

所以我正在寻找以'ATG'开头的序列,然后是任何单词char的单位为3 [例如'GGG','GTT','TTA'等]直到它遇到'TAA','TAG'或'TGA'我还希望它们至少30个字符。 。 。因此{9,}?

This works to some degree but if you notice in seq that there is ATG GAA GTT GGA TGA AAG TGG AGG TAA AGA GAA GAC GTT TGA

这在某种程度上有效,但如果你注意到有seg GAA GTT GGA TGA AAG TGG AGG TAA AGA GAA GAC GTT TGA

So in this case, it should be finding 'ATGGAAGTTGGATGA' if it starts with the first 'ATG' and goes until the next 'TAA','TAG' or 'TGA'

所以在这种情况下,它应该是找到'ATGGAAGTTGGATGA',如果它从第一个'ATG'开始,直到下一个'TAA','TAG'或'TGA'

HOWEVER when you run the OR_0 line of code, it spits back out the entire seq string. I don't know how to make it only consider the first 'TAA','TAG' or 'TGA' followed by the first 'ATG'

但是,当你运行OR_0代码行时,它会吐出整个seq字符串。我不知道如何让它只考虑第一个'TAA','TAG'或'TGA',然后是第一个'ATG'

If an 'ATG' is followed by another 'ATG' when read in units of 3 then that is alright, it should NOT start over but if it encounters a 'TAA','TAG' or 'TGA' when read in units of 3 it should stop.

如果'ATG'后面跟着另一个'ATG',当以3为单位读取时,那就没关系,它不应该重新开始,但如果遇到'TAA','TAG'或'TGA',当以3为单位读取时它应该停止。

My question, why is re.findall finding the longest sequence of 'ATG'xxx-xxx-['TAA','TAG' or 'TGA'] instead of the first occurrence of 'TAA','TAG' or 'TGA' after an ATG separated by word characters in units of 3 ?

我的问题是,为什么re.findall找到'ATG'xxx-xxx的最长序列 - ['TAA','TAG'或'TGA']而不是第一次出现'TAA','TAG'或'TGA' ATG以3为单位用单词字符分隔后?

Once again, I apologize if this is confusing but its messing with multiple data sets that I have based on this initial line of text and i'm trying to find out why

再一次,我道歉,如果这令人困惑,但它弄乱了我基于这个初始文本行的多个数据集,我试图找出原因

4 个解决方案

#1


2  

If you want your regex to stop matching at the first TAA|TAG|TGA, but still only succeed if there are at least nine three letter chunks, the following may help:

如果你希望你的正则表达式在第一个TAA | TAG | TGA停止匹配,但是如果至少有九个三个字母的块,则仍然只能成功,以下内容可能会有所帮助:

>>> import re
>>> regexp = r'ATG(?:(?!TAA|TAG|TGA)...){9,}?(?:TAA|TAG|TGA)'
>>> re.findall(regexp, 'ATGAAAAAAAAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAAAAAAAAAAAAAAAAAAAAAAAAATAG']
>>> re.findall(regexp, 'ATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAG']
>>> re.findall(regexp, 'ATGAAATAGAAAAAAAAAAAAAAAAAAAAATAG')
[]

This uses a negative lookahead (?!TAA|TAG|TGA) to ensure that a three character chunk is not a TAA|TAG|TGA before it matches the three character chunk.

这使用负前瞻(?!TAA | TAG | TGA)来确保三个字符块在与三个字符块匹配之前不是TAA | TAG | TGA。

Note though that a TAA|TAG|TGA that does not fall on a three character boundary will still successfully match:

请注意,不落在三个字符边界的TAA | TAG | TGA仍将成功匹配:

>>> re.findall(regexp, 'ATGAAAATAGAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAATAGAAAAAAAAAAAAAAAAAAAATAG']

#2


1  

If the length is not a requirement then it's pretty easy:

如果长度不是要求那么它很容易:

>>> import re
>>> seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
>>> regex = re.compile(r'ATG(?:...)*?(?:TAA|TAG|TGA)')
>>> regex.findall(seq)
['ATGGAAGTTGGATGA']

Anyway I believe, according to your explanation, that your previous regex is actually doing what you want: searching for matches of at least 30 characters that start in ATG and end in TGA.

无论如何,我相信,根据你的解释,你以前的正则表达式实际上正在做你想要的:搜索至少30个字符的匹配,从ATG开始到TGA结束。

In your question you first state that you need matches of at least 30 characters, and hence you put the {9,}?, but after that you expect to match any match. You cannot have both, choose one. If length is important than keep the regex you already have and the result you are getting is correct.

在您的问题中,您首先声明您需要至少30个字符的匹配,因此您放置{9,} ?,但之后您希望匹配任何匹配。你不能两者兼顾,选择一个。如果长度比保留你已经拥有的正则表达式重要,那么你获得的结果是正确的。

#3


0  

You don't need regular expressions.

您不需要正则表达式。

def chunks(l, n):
    """ Yield successive n-sized chunks from l.
    from: http://*.com/a/312464/1561176
    """
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

def method(sequence, start=['ATG'], stop=['TAA','TAG','TGA'], min_len=30):
    response = ''
    started = False
    for x in chunks(sequence, 3):
        if x in start:
            started = True
            response += x
        elif x in stop and started:
            if len(response) >= min_len:
                yield response + x
                response = ''
                started = False
            else:
                response += x
        elif started:
            response += x
    yield response

for result in method('ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'):
    print result

If I use the min_len of 30, the return is:

如果我使用30的min_len,则返回:

ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA

If I use a min_len of 0, the return is:

如果我使用min_len为0,则返回值为:

ATGGAAGTTGGATGA

#4


0  

Try this:

尝试这个:

seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
OR_0 = re.findall(r'ATG(?:.{3})*?(?:TAA|TAG|TGA)',seq) 

#1


2  

If you want your regex to stop matching at the first TAA|TAG|TGA, but still only succeed if there are at least nine three letter chunks, the following may help:

如果你希望你的正则表达式在第一个TAA | TAG | TGA停止匹配,但是如果至少有九个三个字母的块,则仍然只能成功,以下内容可能会有所帮助:

>>> import re
>>> regexp = r'ATG(?:(?!TAA|TAG|TGA)...){9,}?(?:TAA|TAG|TGA)'
>>> re.findall(regexp, 'ATGAAAAAAAAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAAAAAAAAAAAAAAAAAAAAAAAAATAG']
>>> re.findall(regexp, 'ATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAG']
>>> re.findall(regexp, 'ATGAAATAGAAAAAAAAAAAAAAAAAAAAATAG')
[]

This uses a negative lookahead (?!TAA|TAG|TGA) to ensure that a three character chunk is not a TAA|TAG|TGA before it matches the three character chunk.

这使用负前瞻(?!TAA | TAG | TGA)来确保三个字符块在与三个字符块匹配之前不是TAA | TAG | TGA。

Note though that a TAA|TAG|TGA that does not fall on a three character boundary will still successfully match:

请注意,不落在三个字符边界的TAA | TAG | TGA仍将成功匹配:

>>> re.findall(regexp, 'ATGAAAATAGAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAATAGAAAAAAAAAAAAAAAAAAAATAG']

#2


1  

If the length is not a requirement then it's pretty easy:

如果长度不是要求那么它很容易:

>>> import re
>>> seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
>>> regex = re.compile(r'ATG(?:...)*?(?:TAA|TAG|TGA)')
>>> regex.findall(seq)
['ATGGAAGTTGGATGA']

Anyway I believe, according to your explanation, that your previous regex is actually doing what you want: searching for matches of at least 30 characters that start in ATG and end in TGA.

无论如何,我相信,根据你的解释,你以前的正则表达式实际上正在做你想要的:搜索至少30个字符的匹配,从ATG开始到TGA结束。

In your question you first state that you need matches of at least 30 characters, and hence you put the {9,}?, but after that you expect to match any match. You cannot have both, choose one. If length is important than keep the regex you already have and the result you are getting is correct.

在您的问题中,您首先声明您需要至少30个字符的匹配,因此您放置{9,} ?,但之后您希望匹配任何匹配。你不能两者兼顾,选择一个。如果长度比保留你已经拥有的正则表达式重要,那么你获得的结果是正确的。

#3


0  

You don't need regular expressions.

您不需要正则表达式。

def chunks(l, n):
    """ Yield successive n-sized chunks from l.
    from: http://*.com/a/312464/1561176
    """
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

def method(sequence, start=['ATG'], stop=['TAA','TAG','TGA'], min_len=30):
    response = ''
    started = False
    for x in chunks(sequence, 3):
        if x in start:
            started = True
            response += x
        elif x in stop and started:
            if len(response) >= min_len:
                yield response + x
                response = ''
                started = False
            else:
                response += x
        elif started:
            response += x
    yield response

for result in method('ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'):
    print result

If I use the min_len of 30, the return is:

如果我使用30的min_len,则返回:

ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA

If I use a min_len of 0, the return is:

如果我使用min_len为0,则返回值为:

ATGGAAGTTGGATGA

#4


0  

Try this:

尝试这个:

seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
OR_0 = re.findall(r'ATG(?:.{3})*?(?:TAA|TAG|TGA)',seq)