Python:如何删除以特定字符结尾的行?

时间:2022-10-10 22:20:18

I have a large data file and I need to delete rows that end in certain letters.

我有一个大的数据文件,我需要删除在某些字母中结束的行。

Here is an example of the file I'm using:

下面是我正在使用的文件的一个例子:

User Name     DN
MB212DA       CN=MB212DA,CN=Users,DC=prod,DC=trovp,DC=net
MB423DA       CN=MB423DA,OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
MB424PL       CN=MB424PL,CN=Users,DC=prod,DC=trovp,DC=net
MBDA423       CN=MBDA423,OU=DNA,DC=prod,DC=trovp,DC=net
MB2ADA4       CN=MB2ADA4,OU=DNA,DC=prod,DC=trovp,DC=netenter code here

Code I am using:

我用代码:

from pandas import DataFrame, read_csv
import pandas as pd
f = pd.read_csv('test1.csv', sep=',',encoding='latin1') 
df = f.loc[~(~pd.isnull(f['User Name']) & f['UserName'].str.contains("DA|PL",))]

How do I use regular expression syntax to delete the words that end in "DA" and "PL" but make sure I do not delete the other rows because they contain "DA" or "PL" inside of them?

如何使用正则表达式语法删除在“DA”和“PL”中结束的单词,但确保不删除其他行,因为它们包含“DA”或“PL”?

It should delete the rows and I end up with a file like this:

它应该删除这些行,我最后得到的文件是这样的:

User Name     DN
MBDA423       CN=MBDA423,OU=DNA,DC=prod,DC=trovp,DC=net
MB2ADA4       CN=MB2ADA4,OU=DNA,DC=prod,DC=trovp,DC=net

First 3 rows are deleted because they ended in DA and PL.

前3行被删除,因为它们在DA和PL中结束。

3 个解决方案

#1


8  

You could use this expression

你可以用这个表达式

df = df[~df['User Name'].str.contains('(?:DA|PL)$')]

It will return all rows that don't end in either DA or PL.

它将返回未在DA或PL中结束的所有行。

The ?: is so that the brackets would not capture anything. Otherwise, you'd see pandas returning the following (harmless) warning:

是为了使括号不包含任何内容。否则,你会看到大熊猫返回以下(无害的)警告:

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.

Alternatively, using endswith() and without regular expressions, the same filtering could be achieved by using the following expression:

或者,使用endswith()并且不使用正则表达式,可以使用以下表达式来实现相同的过滤:

df = df[~df['User Name'].str.endswith(('DA', 'PL'))]

As expected, the version without regular expression will be faster. A simple test, consisting of big_df, which consists of 10001 copies of your original df:

如预期的那样,没有正则表达式的版本将会更快。一个简单的测试,由big_df组成,其中包含您的原始df的10001个副本:

# Create a larger DF to get better timing results
big_df = df.copy()

for i in range(10000):
    big_df = big_df.append(df)

print(big_df.shape)

>> (50005, 2)

# Without regular expressions
%%timeit
big_df[~big_df['User Name'].str.endswith(('DA', 'PL'))]

>> 10 loops, best of 3: 22.3 ms per loop

# With regular expressions
%%timeit
big_df[~big_df['User Name'].str.contains('(?:DA|PL)$')]

>> 10 loops, best of 3: 61.8 ms per loop

#2


2  

You can use a boolean mask whereby you check if the last two characters of User_Name are in not (~) in a set of two character endings:

您可以使用一个boolean掩码来检查User_Name的最后两个字符是否在两个字符结尾的集合中(~):

>>> df[~df.User_Name.str[-2:].isin(['DA', 'PA'])]
  User_Name                                                 DN
2   MB424PL    CN=MB424PL, CN=Users, DC=prod, DC=trovp, DC=net
3   MBDA423      CN=MBDA423, OU=DNA, DC=prod, DC=trovp, DC=net
4   MB2ADA4  CN=MB2ADA4, OU=DNA, DC=prod, DC=trovp, DC=nete...

#3


0  

Instead of regular expressions, you can use the endswith() method to check if a string ends with a specific pattern.

您可以使用endswith()方法来检查字符串是否以特定的模式结束,而不是正则表达式。

I.e.:

例如:

for row in rows:
    if row.endswith('DA') or row.endswith('PL'):
        #doSomething

You should create another df using the filtered data, and then use pd.to_csv() to save a clean version of your file.

您应该使用过滤后的数据创建另一个df,然后使用pd.to_csv()保存文件的干净版本。

#1


8  

You could use this expression

你可以用这个表达式

df = df[~df['User Name'].str.contains('(?:DA|PL)$')]

It will return all rows that don't end in either DA or PL.

它将返回未在DA或PL中结束的所有行。

The ?: is so that the brackets would not capture anything. Otherwise, you'd see pandas returning the following (harmless) warning:

是为了使括号不包含任何内容。否则,你会看到大熊猫返回以下(无害的)警告:

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.

Alternatively, using endswith() and without regular expressions, the same filtering could be achieved by using the following expression:

或者,使用endswith()并且不使用正则表达式,可以使用以下表达式来实现相同的过滤:

df = df[~df['User Name'].str.endswith(('DA', 'PL'))]

As expected, the version without regular expression will be faster. A simple test, consisting of big_df, which consists of 10001 copies of your original df:

如预期的那样,没有正则表达式的版本将会更快。一个简单的测试,由big_df组成,其中包含您的原始df的10001个副本:

# Create a larger DF to get better timing results
big_df = df.copy()

for i in range(10000):
    big_df = big_df.append(df)

print(big_df.shape)

>> (50005, 2)

# Without regular expressions
%%timeit
big_df[~big_df['User Name'].str.endswith(('DA', 'PL'))]

>> 10 loops, best of 3: 22.3 ms per loop

# With regular expressions
%%timeit
big_df[~big_df['User Name'].str.contains('(?:DA|PL)$')]

>> 10 loops, best of 3: 61.8 ms per loop

#2


2  

You can use a boolean mask whereby you check if the last two characters of User_Name are in not (~) in a set of two character endings:

您可以使用一个boolean掩码来检查User_Name的最后两个字符是否在两个字符结尾的集合中(~):

>>> df[~df.User_Name.str[-2:].isin(['DA', 'PA'])]
  User_Name                                                 DN
2   MB424PL    CN=MB424PL, CN=Users, DC=prod, DC=trovp, DC=net
3   MBDA423      CN=MBDA423, OU=DNA, DC=prod, DC=trovp, DC=net
4   MB2ADA4  CN=MB2ADA4, OU=DNA, DC=prod, DC=trovp, DC=nete...

#3


0  

Instead of regular expressions, you can use the endswith() method to check if a string ends with a specific pattern.

您可以使用endswith()方法来检查字符串是否以特定的模式结束,而不是正则表达式。

I.e.:

例如:

for row in rows:
    if row.endswith('DA') or row.endswith('PL'):
        #doSomething

You should create another df using the filtered data, and then use pd.to_csv() to save a clean version of your file.

您应该使用过滤后的数据创建另一个df,然后使用pd.to_csv()保存文件的干净版本。