使用Python正则表达式在一个模式中剪切

时间:2022-09-13 12:12:09

Objective: I am trying to perform a cut in Python RegEx where split doesn't quite do what I want. I need to cut within a pattern, but between characters.

目标:我正在尝试在Python RegEx中执行切换,其中split不能完全按照我的意愿执行。我需要在一个模式中剪切,但在角色之间。

What I am looking for:

我在找什么:

I need to recognize the pattern below in a string, and split the string at the location of the pipe. The pipe isn't actually in the string, it just shows where I want to split.

我需要在字符串中识别下面的模式,并将字符串拆分到管道的位置。管道实际上不在字符串中,它只显示我要分割的位置。

Pattern: CDE|FG

模式:CDE | FG

String: ABCDEFGHIJKLMNOCDEFGZYPE

字符串:ABCDEFGHIJKLMNOCDEFGZYPE

Results: ['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

结果:['ABCDE','FGHIJKLMNOCDE','FGZYPE']

What I have tried:

我试过的:

I seems like using split with parenthesis is close, but it doesn't keep the search pattern attached to the results like I need it to.

我似乎使用带括号的split是接近的,但它并没有将搜索模式保持在结果上,就像我需要它一样。

re.split('CDE()FG', 'ABCDEFGHIJKLMNOCDEFGZYPE')

re.split('CDE()FG','ABCDEFGHIJKLMNOCDEFGZYPE')

Gives,

给人,

['AB', 'HIJKLMNO', 'ZYPE']

['AB','HIJKLMNO','ZYPE']

When I actually need,

当我真的需要的时候,

['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

['ABCDE','FGHIJKLMNOCDE','FGZYPE']

Motivation:

动机:

Practicing with RegEx, and wanted to see if I could use RegEx to make a script that would predict the fragments of a protein digestion using specific proteases.

使用RegEx进行练习,并想看看我是否可以使用RegEx制作一个脚本来预测使用特定蛋白酶消化蛋白质的片段。

4 个解决方案

#1


7  

A non regex way would be to replace the pattern with the piped value and then split.

非正则表达方式是用管道值替换模式然后拆分。

>>> pattern = 'CDE|FG'
>>> s = 'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> s.replace('CDEFG',pattern).split('|')
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

#2


5  

You can solve it with re.split() and positive "look arounds":

你可以用re.split()和积极的“环顾四周”解决它:

>>> re.split(r"(?<=CDE)(\w+)(?=FG)", s)
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

Note that if one of the cut sequences is an empty string, you would get an empty string inside the resulting list. You can handle that "manually", sample (I admit, it is not that pretty):

请注意,如果其中一个剪切序列是空字符串,您将在结果列表中获得一个空字符串。你可以处理“手动”,样品(我承认,它不是那么漂亮):

import re

s = "ABCDEFGHIJKLMNOCDEFGZYPE"

cut_sequences = [
    ["CDE", "FG"],
    ["FGHI", ""],
    ["", "FGHI"]
]

for left, right in cut_sequences:
    items = re.split(r"(?<={left})(\w+)(?={right})".format(left=left, right=right), s)

    if not left:
        items = items[1:]

    if not right:
        items = items[:-1]

    print(items)

Prints:

打印:

['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
['ABCDEFGHI', 'JKLMNOCDEFGZYPE']
['ABCDE', 'FGHIJKLMNOCDEFGZYPE']

#3


2  

To keep the splitting pattern when you split with re.split, or parts of it, enclose them in parentheses.

要在使用re.split或其中的一部分进行拆分时保留拆分模式,请将它们括在括号中。

>>> data
'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> pieces = re.split(r"(CDE)(FG)", data)
>>> pieces
['AB', 'CDE', 'FG', 'HIJKLMNO', 'CDE', 'FG', 'ZYPE']

Easy enough. All the parts are there, but as you can see they have been separated. So we need to reassemble them. That's the trickier part. Look carefully and you'll see you need to join the first two pieces, the last two pieces, and the rest in triples. I simplify the code by padding the list, but you could do it with the original list (and a bit of extra code) if performance is a problem.

很容易。所有部件都在那里,但正如你所看到的那样,它们已经分开了。所以我们需要重新组装它们。这是比较棘手的部分。仔细看,你会发现你需要加入前两件,最后两件,其余三件。我通过填充列表来简化代码,但如果性能有问题,您可以使用原始列表(以及一些额外的代码)来完成。

>>> pieces = [""] + pieces
>>> [ "".join(pieces[i:i+3]) for i in range(0,len(pieces), 3) ]
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

re.split() guarantees a piece for every capturing (parenthesized) group, plus a piece for what's between. With more complex regular expressions that need their own grouping, use non-capturing groups to keep the format of the returned data the same. (Otherwise you'll need to adapt the reassembly step.)

re.split()保证每个捕获(括号)组的一个片段,以及两者之间的片段。对于需要自己分组的更复杂的正则表达式,使用非捕获组来保持返回数据的格式相同。 (否则你需要调整重组步骤。)

PS. I also like Bhargav Rao's suggestion to insert a separator character in the string. If performance is not an issue, I guess it's a matter of taste.

PS。我也喜欢Bhargav Rao建议在字符串中插入分隔符。如果性能不是问题,我想这是一个品味问题。

Edit: Here's a (less transparent) way to do it without adding an empty string to the list:

编辑:这是一种(不太透明)的方式,无需在列表中添加空字符串:

pieces = re.split(r"(CDE)(FG)", data)
result = [ "".join(pieces[max(i-3,0):i]) for i in range(2,len(pieces)+2, 3) ]

#4


1  

A safer non-regex solution could be this:

更安全的非正则表达式解决方案可能是这样的:

import re

def split(string, pattern):
    """Split the given string in the place indicated by a pipe (|) in the pattern"""
    safe_splitter = "#@#@SPLIT_HERE@#@#"
    safe_pattern = pattern.replace("|", safe_splitter)
    string = string.replace(pattern.replace("|", ""), safe_pattern)
    return string.split(safe_splitter)

s = "ABCDEFGHIJKLMNOCDEFGZYPE"
print(split(s, "CDE|FG"))
print(split(s, "|FG"))
print(split(s, "FGH|"))

https://repl.it/C448

https://repl.it/C448

#1


7  

A non regex way would be to replace the pattern with the piped value and then split.

非正则表达方式是用管道值替换模式然后拆分。

>>> pattern = 'CDE|FG'
>>> s = 'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> s.replace('CDEFG',pattern).split('|')
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

#2


5  

You can solve it with re.split() and positive "look arounds":

你可以用re.split()和积极的“环顾四周”解决它:

>>> re.split(r"(?<=CDE)(\w+)(?=FG)", s)
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

Note that if one of the cut sequences is an empty string, you would get an empty string inside the resulting list. You can handle that "manually", sample (I admit, it is not that pretty):

请注意,如果其中一个剪切序列是空字符串,您将在结果列表中获得一个空字符串。你可以处理“手动”,样品(我承认,它不是那么漂亮):

import re

s = "ABCDEFGHIJKLMNOCDEFGZYPE"

cut_sequences = [
    ["CDE", "FG"],
    ["FGHI", ""],
    ["", "FGHI"]
]

for left, right in cut_sequences:
    items = re.split(r"(?<={left})(\w+)(?={right})".format(left=left, right=right), s)

    if not left:
        items = items[1:]

    if not right:
        items = items[:-1]

    print(items)

Prints:

打印:

['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
['ABCDEFGHI', 'JKLMNOCDEFGZYPE']
['ABCDE', 'FGHIJKLMNOCDEFGZYPE']

#3


2  

To keep the splitting pattern when you split with re.split, or parts of it, enclose them in parentheses.

要在使用re.split或其中的一部分进行拆分时保留拆分模式,请将它们括在括号中。

>>> data
'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> pieces = re.split(r"(CDE)(FG)", data)
>>> pieces
['AB', 'CDE', 'FG', 'HIJKLMNO', 'CDE', 'FG', 'ZYPE']

Easy enough. All the parts are there, but as you can see they have been separated. So we need to reassemble them. That's the trickier part. Look carefully and you'll see you need to join the first two pieces, the last two pieces, and the rest in triples. I simplify the code by padding the list, but you could do it with the original list (and a bit of extra code) if performance is a problem.

很容易。所有部件都在那里,但正如你所看到的那样,它们已经分开了。所以我们需要重新组装它们。这是比较棘手的部分。仔细看,你会发现你需要加入前两件,最后两件,其余三件。我通过填充列表来简化代码,但如果性能有问题,您可以使用原始列表(以及一些额外的代码)来完成。

>>> pieces = [""] + pieces
>>> [ "".join(pieces[i:i+3]) for i in range(0,len(pieces), 3) ]
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

re.split() guarantees a piece for every capturing (parenthesized) group, plus a piece for what's between. With more complex regular expressions that need their own grouping, use non-capturing groups to keep the format of the returned data the same. (Otherwise you'll need to adapt the reassembly step.)

re.split()保证每个捕获(括号)组的一个片段,以及两者之间的片段。对于需要自己分组的更复杂的正则表达式,使用非捕获组来保持返回数据的格式相同。 (否则你需要调整重组步骤。)

PS. I also like Bhargav Rao's suggestion to insert a separator character in the string. If performance is not an issue, I guess it's a matter of taste.

PS。我也喜欢Bhargav Rao建议在字符串中插入分隔符。如果性能不是问题,我想这是一个品味问题。

Edit: Here's a (less transparent) way to do it without adding an empty string to the list:

编辑:这是一种(不太透明)的方式,无需在列表中添加空字符串:

pieces = re.split(r"(CDE)(FG)", data)
result = [ "".join(pieces[max(i-3,0):i]) for i in range(2,len(pieces)+2, 3) ]

#4


1  

A safer non-regex solution could be this:

更安全的非正则表达式解决方案可能是这样的:

import re

def split(string, pattern):
    """Split the given string in the place indicated by a pipe (|) in the pattern"""
    safe_splitter = "#@#@SPLIT_HERE@#@#"
    safe_pattern = pattern.replace("|", safe_splitter)
    string = string.replace(pattern.replace("|", ""), safe_pattern)
    return string.split(safe_splitter)

s = "ABCDEFGHIJKLMNOCDEFGZYPE"
print(split(s, "CDE|FG"))
print(split(s, "|FG"))
print(split(s, "FGH|"))

https://repl.it/C448

https://repl.it/C448