熊猫:我如何在每个小组中选择第一行?

时间:2022-04-04 01:22:16

Basically the same as Select first row in each GROUP BY group? only in pandas.

基本上和在每个组中按组选择第一行是一样的吗?只有在熊猫。

df = pd.DataFrame({'A' : ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar'],
                'B' : ['3', '1', '2', '4','2', '4', '1', '3'],
                    })

Sorting looks promising:

排序是承诺:

df.sort('B')

     A  B
1  foo  1
6  bar  1
2  foo  2
4  bar  2
0  foo  3
7  bar  3
3  foo  4
5  bar  4

But then first won't give the desired result... df.groupby('A').first()

但是,首先不会给出想要的结果。df.groupby(A)当代()

     B
A     
bar  2
foo  3

3 个解决方案

#1


5  

Generally if you want your data sorted in a groupby but it's not one of the columns which are going to be grouped on then it's better to sort the df prior to performing groupby:

一般来说,如果你希望你的数据在groupby中排序,但它不是要分组的列之一,那么最好在执行groupby之前对df进行排序:

In [5]:
df.sort_values('B').groupby('A').first()

Out[5]:
     B
A     
bar  1
foo  1

#2


5  

The pandas groupby function could be used for what you want, but it's really meant for aggregation. This is a simple 'take the first' operation.

熊猫群组函数可以用于你想要的东西,但它实际上是用来聚集的。这是一个简单的“take the first”操作。

What you actually want is the pandas drop_duplicates function, which by default will return the first row. What you usually would consider the groupby key, you should pass as the subset= variable

您真正想要的是熊猫drop_duplicate函数,默认情况下它将返回第一行。您通常认为groupby键是什么,您应该将其作为子集=变量传递

df.drop_duplicates(subset='A')

Should do what you want.

应该做你想做的。

Also, df.sort('A') does not sort the DataFrame df, it returns a copy which is sorted. If you want to sort it, you have to add the inplace=True parameter.

而且,df.sort('A')并不对DataFrame df进行排序,它返回一个已排序的副本。如果要对它进行排序,必须添加inplace=True参数。

df.sort('A', inplace=True)

#3


5  

Here's an alternative approach using groupby().rank():

这里有一个使用groupby().rank()的替代方法:

df[ df.groupby('A')['B'].rank() == 1 ]

     A  B
1  foo  1
6  bar  1

This gives you the same answer as @EdChum's for the OP's sample dataframe, but could give a different answer if you have any ties during the sort, for example, with data like this:

这将为您提供与OP的示例dataframe的@EdChum相同的答案,但如果您在这类数据中有任何关联,例如,使用如下数据,则可以给出不同的答案:

df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'], 
                   'B': ['2', '1', '1', '1'] })

In this case you have some options using the optional method argument, depending on how you wish to handle sorting ties:

在这种情况下,根据您希望如何处理排序关系,您可以使用可选方法参数:

df[ df.groupby('A')['B'].rank(method='average') == 1 ]   # the default
df[ df.groupby('A')['B'].rank(method='min')     == 1 ]
df[ df.groupby('A')['B'].rank(method='first')   == 1 ]   # doesn't work, not sure why

#1


5  

Generally if you want your data sorted in a groupby but it's not one of the columns which are going to be grouped on then it's better to sort the df prior to performing groupby:

一般来说,如果你希望你的数据在groupby中排序,但它不是要分组的列之一,那么最好在执行groupby之前对df进行排序:

In [5]:
df.sort_values('B').groupby('A').first()

Out[5]:
     B
A     
bar  1
foo  1

#2


5  

The pandas groupby function could be used for what you want, but it's really meant for aggregation. This is a simple 'take the first' operation.

熊猫群组函数可以用于你想要的东西,但它实际上是用来聚集的。这是一个简单的“take the first”操作。

What you actually want is the pandas drop_duplicates function, which by default will return the first row. What you usually would consider the groupby key, you should pass as the subset= variable

您真正想要的是熊猫drop_duplicate函数,默认情况下它将返回第一行。您通常认为groupby键是什么,您应该将其作为子集=变量传递

df.drop_duplicates(subset='A')

Should do what you want.

应该做你想做的。

Also, df.sort('A') does not sort the DataFrame df, it returns a copy which is sorted. If you want to sort it, you have to add the inplace=True parameter.

而且,df.sort('A')并不对DataFrame df进行排序,它返回一个已排序的副本。如果要对它进行排序,必须添加inplace=True参数。

df.sort('A', inplace=True)

#3


5  

Here's an alternative approach using groupby().rank():

这里有一个使用groupby().rank()的替代方法:

df[ df.groupby('A')['B'].rank() == 1 ]

     A  B
1  foo  1
6  bar  1

This gives you the same answer as @EdChum's for the OP's sample dataframe, but could give a different answer if you have any ties during the sort, for example, with data like this:

这将为您提供与OP的示例dataframe的@EdChum相同的答案,但如果您在这类数据中有任何关联,例如,使用如下数据,则可以给出不同的答案:

df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'], 
                   'B': ['2', '1', '1', '1'] })

In this case you have some options using the optional method argument, depending on how you wish to handle sorting ties:

在这种情况下,根据您希望如何处理排序关系,您可以使用可选方法参数:

df[ df.groupby('A')['B'].rank(method='average') == 1 ]   # the default
df[ df.groupby('A')['B'].rank(method='min')     == 1 ]
df[ df.groupby('A')['B'].rank(method='first')   == 1 ]   # doesn't work, not sure why