熊猫在每组中获得最高n条记录

时间:2022-07-06 22:58:28

Suppose I have pandas DataFrame like this:

假设我有熊猫数据帧是这样的:

>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
   id  value
0   1      1
1   1      2
2   1      3
3   2      1
4   2      2
5   2      3
6   2      4
7   3      1
8   4      1

I want to get a new DataFrame with top 2 records for each id, like this:

我想为每个id获取一个包含前2条记录的新DataFrame,如下所示:

   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

I can do it with numbering records within group after group by:

我可以通过组编号后,组内记录做到这一点:

>>> dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
>>> dfN
   id  level_1  index  value
0   1        0      0      1
1   1        1      1      2
2   1        2      2      3
3   2        0      3      1
4   2        1      4      2
5   2        2      5      3
6   2        3      6      4
7   3        0      7      1
8   4        0      8      1
>>> dfN[dfN['level_1'] <= 1][['id', 'value']]
   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

但是,有没有更有效的/优雅的方式来做到这一点?并且也有每个组中更优雅的方式来记录数字(如SQL窗口函数ROW_NUMBER())。

2 个解决方案

#1


104  

Did you try df.groupby('id').head(2)

你尝试过df.groupby('id')。head(2)

Ouput generated:

输出继电器产生:

>>> df.groupby('id').head(2)
       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1

(Keep in mind that you might need to order/sort before, depending on your data)

(请记住,您可能需要先订购/排序,具体取决于您的数据)

EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True) to remove the multindex and flatten the results.

编辑:如提问者所述,使用df.groupby('id')。head(2).reset_index(drop = True)删除多索引并展平结果。

>>> df.groupby('id').head(2).reset_index(drop=True)
    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1

#2


84  

Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:

由于0.14.1,你现在可以做nlargest和nsmallest一个GROUPBY对象:

In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]: 
id   
1   2    3
    1    2
2   6    4
    5    3
3   7    1
4   8    1
dtype: int64

There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.

还有,你在那里得到的原始索引以及轻微的怪事,但根据原来的指数是什么,这可能是非常有用的。

If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.

如果你对它不感兴趣,你可以做.reset_index(level = 1,drop = True)来完全摆脱它。

(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)

(注意:从0.17.1开始,你也可以在DataFrameGroupBy上执行此操作,但现在它只适用于Series和SeriesGroupBy。)

#1


104  

Did you try df.groupby('id').head(2)

你尝试过df.groupby('id')。head(2)

Ouput generated:

输出继电器产生:

>>> df.groupby('id').head(2)
       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1

(Keep in mind that you might need to order/sort before, depending on your data)

(请记住,您可能需要先订购/排序,具体取决于您的数据)

EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True) to remove the multindex and flatten the results.

编辑:如提问者所述,使用df.groupby('id')。head(2).reset_index(drop = True)删除多索引并展平结果。

>>> df.groupby('id').head(2).reset_index(drop=True)
    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1

#2


84  

Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:

由于0.14.1,你现在可以做nlargest和nsmallest一个GROUPBY对象:

In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]: 
id   
1   2    3
    1    2
2   6    4
    5    3
3   7    1
4   8    1
dtype: int64

There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.

还有,你在那里得到的原始索引以及轻微的怪事,但根据原来的指数是什么,这可能是非常有用的。

If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.

如果你对它不感兴趣,你可以做.reset_index(level = 1,drop = True)来完全摆脱它。

(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)

(注意:从0.17.1开始,你也可以在DataFrameGroupBy上执行此操作,但现在它只适用于Series和SeriesGroupBy。)