Python -匹配包含双括号的变量regex

时间:2022-09-13 07:47:45

I have a DataFrame 'df' and a list of strings 'l'. I want to iterate through the list and find the rows of the DataFrame matching with strings from the list. Following code works fine if there are no brackets in the list elements. It seems like the regex is not defined properly and somehow the double brackets are not getting matched.

我有一个DataFrame 'df'和一个字符串'l'的列表。我想遍历列表并找到与列表中的字符串匹配的DataFrame的行。如果列表元素中没有括号,下面的代码可以正常工作。看起来regex没有被正确定义,而且双括号没有被匹配。

import pandas as pd
import re

d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1', 
              'xyz', 'xyz2', 'zzz'], 
     'col2': ['100', '1001','200', '300', '400', '500']}

df = pd.DataFrame(d)

lst = ['100-(abc)', 'xyz']

for l in lst:
    print("======================")
    pattern = re.compile(r"(" + l + ")$")
    print(df[df.col1.str.contains(pattern, regex=True)])

result:

结果:

======================
Empty DataFrame
Columns: [col1, col2]
Index: []
======================
  col1 col2
3  xyz  300

Expected result:

预期结果:

======================
  col1           col2
0  100-(abc)     100
1  qwe-100-(abc) 1001

======================
  col1 col2
3  xyz  300

3 个解决方案

#1


2  

You need to understand that:

你需要明白:

Regex have some reserve certain characters for special use the opening parenthesis (, the closing parenthesis ), are one of them.

Regex有一些特定的字符用于特殊使用,比如开头的括号(即结尾的括号),就是其中之一。

If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign has a special meaning. Same with parenthesis , if you want to match (abc) you have to do \(abc\)

如果您想在regex中使用这些字符中的任何一个作为文字,您需要使用反斜杠来转义它们。如果想匹配1+1=2,正确的regex是1\+1=2。否则,加号就有特殊的含义。括号也一样,如果你想匹配(abc)你必须做(abc)

import pandas as pd
import re

d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1',
              'xyz', 'xyz2', 'zzz'],
     'col2': ['100', '1001','200', '300', '400', '500']}

df = pd.DataFrame(d)

lst = ['100-(abc)', 'xyz']


for l in lst:
    print("======================")
    if '(' in l:
        match=l.replace('(','\(').replace(')','\)')
        pattern = r"(" + match + ")$"
        print(df[df.col1.str.contains(pattern, regex=True)])
    else:
        pattern = r"(" + l + ")$"
        print(df[df.col1.str.contains(pattern, regex=True)])

output:

输出:

            col1  col2
0      100-(abc)   100
1  qwe-100-(abc)  1001
======================
  col1 col2
3  xyz  300

#2


1  

Simply use isin

简单地使用型号

df[df.col1.isin(lst)]


    col1        col2
0   100-(abc)   100
3   xyz         300

Edit: Add in a regex pattern along with isin

编辑:与isin一起添加regex模式。

df[(df.col1.isin(lst)) | (df.col1.str.contains('\d+-\(.*\)$', regex = True))]

You get

你得到

    col1            col2
0   100-(abc)       100
1   qwe-100-(abc)   1001
3   xyz             300

#3


0  

Try this: This will work for your case

试试这个:这个对你的案子有用

I have edited the code check this it gave exact output result.

我编辑了代码检查,它给出了准确的输出结果。

import pandas as pd
import re

d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1', 
              'xyz', 'xyz2', 'zzz'], 
     'col2': ['100', '1001','200', '300', '400', '500']}

df = pd.DataFrame(d)

#lst = ['100-(abc)', 'xyz']
lst2 = [r'\w.*[(abc)]$',r'xyz$',]

for index,l in enumerate(lst2):

    print("======================")
    pattern = re.compile(lst2[index])       
    print(df[df.col1.str.contains(pattern, regex=True)])

======================
            col1  col2
0      100-(abc)   100
1  qwe-100-(abc)  1001
======================
  col1 col2
3  xyz  300

This is what you wanted right.

这就是你想要的。

#1


2  

You need to understand that:

你需要明白:

Regex have some reserve certain characters for special use the opening parenthesis (, the closing parenthesis ), are one of them.

Regex有一些特定的字符用于特殊使用,比如开头的括号(即结尾的括号),就是其中之一。

If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign has a special meaning. Same with parenthesis , if you want to match (abc) you have to do \(abc\)

如果您想在regex中使用这些字符中的任何一个作为文字,您需要使用反斜杠来转义它们。如果想匹配1+1=2,正确的regex是1\+1=2。否则,加号就有特殊的含义。括号也一样,如果你想匹配(abc)你必须做(abc)

import pandas as pd
import re

d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1',
              'xyz', 'xyz2', 'zzz'],
     'col2': ['100', '1001','200', '300', '400', '500']}

df = pd.DataFrame(d)

lst = ['100-(abc)', 'xyz']


for l in lst:
    print("======================")
    if '(' in l:
        match=l.replace('(','\(').replace(')','\)')
        pattern = r"(" + match + ")$"
        print(df[df.col1.str.contains(pattern, regex=True)])
    else:
        pattern = r"(" + l + ")$"
        print(df[df.col1.str.contains(pattern, regex=True)])

output:

输出:

            col1  col2
0      100-(abc)   100
1  qwe-100-(abc)  1001
======================
  col1 col2
3  xyz  300

#2


1  

Simply use isin

简单地使用型号

df[df.col1.isin(lst)]


    col1        col2
0   100-(abc)   100
3   xyz         300

Edit: Add in a regex pattern along with isin

编辑:与isin一起添加regex模式。

df[(df.col1.isin(lst)) | (df.col1.str.contains('\d+-\(.*\)$', regex = True))]

You get

你得到

    col1            col2
0   100-(abc)       100
1   qwe-100-(abc)   1001
3   xyz             300

#3


0  

Try this: This will work for your case

试试这个:这个对你的案子有用

I have edited the code check this it gave exact output result.

我编辑了代码检查,它给出了准确的输出结果。

import pandas as pd
import re

d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1', 
              'xyz', 'xyz2', 'zzz'], 
     'col2': ['100', '1001','200', '300', '400', '500']}

df = pd.DataFrame(d)

#lst = ['100-(abc)', 'xyz']
lst2 = [r'\w.*[(abc)]$',r'xyz$',]

for index,l in enumerate(lst2):

    print("======================")
    pattern = re.compile(lst2[index])       
    print(df[df.col1.str.contains(pattern, regex=True)])

======================
            col1  col2
0      100-(abc)   100
1  qwe-100-(abc)  1001
======================
  col1 col2
3  xyz  300

This is what you wanted right.

这就是你想要的。