如何在pandas数据框中删除具有重复列值的行?

时间:2021-02-11 02:24:22

I have a pandas data frame which looks like this.

我有一个像这样的pandas数据框。

'Column1' 'Column2' 'Column3'
'cat'     'bat'.    'xyz'
'toy'    'flower'.  'abc'
'cat'    'bat'      'lmn'

I want to identify that cat and bat are same values which have been repeated and hence want to remove one record and preserve only the first record. The resulting data frame should only have.

我想确定cat和bat是重复的相同值,因此想要删除一条记录并仅保留第一条记录。结果数据框应该只有。

'Column1'  'Column2' 'Column3'
'cat'.     'bat'.     'xyz'
'toy'.     'flower'.  'abc'   

3 个解决方案

#1


0  

Using drop_duplicates with subset with list of columns to check for duplicates on and keep='first' to keep first of duplicates.

使用带有列列表的子集的drop_duplicates来检查重复项,并使用keep ='first'来保留重复项的第一个。

If dataframe is:

如果数据帧是:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
                   'Column2': ["'bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)

Result:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Then:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

Result:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

#2


0  

import pandas as pd

df = pd.DataFrame({"Column1":["cat", "dog", "cat"],
                    "Column2":[1,1,1],
                    "Column3":["C","A","B"]})

df = df.drop_duplicates(subset=['Column1'], keep='first')
print(df)

#3


0  

Inside the drop_duplicates() method of Dataframe you can provide a series of column names to eliminate duplicate records from your data.

在Dataframe的drop_duplicates()方法中,您可以提供一系列列名以消除数据中的重复记录。

The following "Tested" code does the same :

以下“已测试”代码执行相同操作:

import pandas as pd

df = pd.DataFrame()
df.insert(loc=0,column='Column1',value=['cat',     'toy',    'cat'])
df.insert(loc=1,column='Column2',value=['bat',    'flower',  'bat'])
df.insert(loc=2,column='Column3',value=['xyz',     'abc',    'lmn'])

df = df.drop_duplicates(subset=['Column1','Column2'],keep='first')
print(df)

Inside of the subset parameter, you can insert other column names as well and by default it will consider all the columns of your data and you can provide keep value as :-

在子集参数内部,您也可以插入其他列名称,默认情况下,它会考虑数据的所有列,并且您可以提供保持值: -

  • first : Drop duplicates except for the first occurrence.
  • first:删除第一次出现的重复项。

  • last : Drop duplicates except for the last occurrence.
  • last:删除重复项,除了最后一次出现。

  • False : Drop all duplicates.
  • 错误:删除所有重复项。

#1


0  

Using drop_duplicates with subset with list of columns to check for duplicates on and keep='first' to keep first of duplicates.

使用带有列列表的子集的drop_duplicates来检查重复项,并使用keep ='first'来保留重复项的第一个。

If dataframe is:

如果数据帧是:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
                   'Column2': ["'bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)

Result:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Then:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

Result:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

#2


0  

import pandas as pd

df = pd.DataFrame({"Column1":["cat", "dog", "cat"],
                    "Column2":[1,1,1],
                    "Column3":["C","A","B"]})

df = df.drop_duplicates(subset=['Column1'], keep='first')
print(df)

#3


0  

Inside the drop_duplicates() method of Dataframe you can provide a series of column names to eliminate duplicate records from your data.

在Dataframe的drop_duplicates()方法中,您可以提供一系列列名以消除数据中的重复记录。

The following "Tested" code does the same :

以下“已测试”代码执行相同操作:

import pandas as pd

df = pd.DataFrame()
df.insert(loc=0,column='Column1',value=['cat',     'toy',    'cat'])
df.insert(loc=1,column='Column2',value=['bat',    'flower',  'bat'])
df.insert(loc=2,column='Column3',value=['xyz',     'abc',    'lmn'])

df = df.drop_duplicates(subset=['Column1','Column2'],keep='first')
print(df)

Inside of the subset parameter, you can insert other column names as well and by default it will consider all the columns of your data and you can provide keep value as :-

在子集参数内部,您也可以插入其他列名称,默认情况下,它会考虑数据的所有列,并且您可以提供保持值: -

  • first : Drop duplicates except for the first occurrence.
  • first:删除第一次出现的重复项。

  • last : Drop duplicates except for the last occurrence.
  • last:删除重复项,除了最后一次出现。

  • False : Drop all duplicates.
  • 错误:删除所有重复项。