python pandas:删除列A的重复项,保持列B中具有最高值的行

时间:2022-12-20 04:45:15

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

我在A列中有一个重复值的数据框。我想删除重复项,保留B列中具有最高值的行。

So this:

所以这:

A B
1 10
1 20
2 30
2 40
3 10

Should turn into this:

应该变成这样:

A B
1 20
2 40
3 10

Wes has added some nice functionality to drop duplicates: http://wesmckinney.com/blog/?p=340. But AFAICT, it's designed for exact duplicates, so there's no mention of criteria for selecting which rows get kept.

Wes添加了一些很好的功能来删除重复项:http://wesmckinney.com/blog/?p = 340。但是AFAICT,它是为精确复制而设计的,所以没有提到选择保留哪些行的标准。

I'm guessing there's probably an easy way to do this---maybe as easy as sorting the dataframe before dropping duplicates---but I don't know groupby's internal logic well enough to figure it out. Any suggestions?

我猜这可能是一种简单的方法 - 可能就像在删除重复项之前对数据帧进行排序一样简单 - 但我不知道groupby的内部逻辑是否足够清楚。有什么建议么?

8 个解决方案

#1


92  

This takes the last. Not the maximum though:

这需要最后一次。虽然不是最大值:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

你也可以这样做:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

#2


15  

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

最重要的答案是做了太多的工作,对于大型数据集看起来非常慢。应用很慢,应尽可能避免。 ix已被弃用,也应该避免使用。

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

或者只是按所有其他列分组并获取所需列的最大值。 df.groupby('A',as_index = False).max()

#3


7  

Try this:

尝试这个:

df.groupby(['A']).max()

#4


2  

You can try this as well

你也可以试试这个

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

我从https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html中提到了这一点。

#5


1  

I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

我认为在你的情况下,你真的不需要一个groupby。我会按降序排序你的B列,然后在A列删除重复项,如果你愿意,你也可以有一个新的漂亮而干净的索引:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

#6


0  

this also works:

这也有效:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

#7


0  

When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.

当已经发布帖子回答问题时,我通过添加应用了max()函数的列名进行了一些小改动,以获得更好的代码可读性。

df.groupby('A', as_index=False)['B'].max()

#8


-4  

I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():

我不打算给你完整的答案(我不认为你正在寻找解析和写入文件部分),但一个关键的提示应该足够了:使用python的set()函数,然后排序()或.sort()与.reverse()结合使用:

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

#1


92  

This takes the last. Not the maximum though:

这需要最后一次。虽然不是最大值:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

你也可以这样做:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

#2


15  

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

最重要的答案是做了太多的工作,对于大型数据集看起来非常慢。应用很慢,应尽可能避免。 ix已被弃用,也应该避免使用。

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

或者只是按所有其他列分组并获取所需列的最大值。 df.groupby('A',as_index = False).max()

#3


7  

Try this:

尝试这个:

df.groupby(['A']).max()

#4


2  

You can try this as well

你也可以试试这个

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

我从https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html中提到了这一点。

#5


1  

I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

我认为在你的情况下,你真的不需要一个groupby。我会按降序排序你的B列,然后在A列删除重复项,如果你愿意,你也可以有一个新的漂亮而干净的索引:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

#6


0  

this also works:

这也有效:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

#7


0  

When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.

当已经发布帖子回答问题时,我通过添加应用了max()函数的列名进行了一些小改动,以获得更好的代码可读性。

df.groupby('A', as_index=False)['B'].max()

#8


-4  

I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():

我不打算给你完整的答案(我不认为你正在寻找解析和写入文件部分),但一个关键的提示应该足够了:使用python的set()函数,然后排序()或.sort()与.reverse()结合使用:

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]