Pandas数据帧获得每组的第一行

时间:2022-11-22 21:22:15

I have a pandas DataFrame like following.

我有一个像下面这样的pandas DataFrame。

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                'value'  : ["first","second","second","first",
                            "second","first","third","fourth",
                            "fifth","second","fifth","first",
                            "first","second","third","fourth","fifth"]})

I want to group this by ["id","value"] and get the first row of each group.

我想通过[“id”,“value”]对此进行分组,并得到每个组的第一行。

        id   value
0        1   first
1        1  second
2        1  second
3        2   first
4        2  second
5        3   first
6        3   third
7        3  fourth
8        3   fifth
9        4  second
10       4   fifth
11       5   first
12       6   first
13       6  second
14       6   third
15       7  fourth
16       7   fifth

Expected outcome

预期结果

    id   value
     1   first
     2   first
     3   first
     4  second
     5  first
     6  first
     7  fourth

I tried following which only gives the first row of the DataFrame. Any help regarding this is appreciated.

我试过以下只给出了DataFrame的第一行。对此有任何帮助表示赞赏。

In [25]: for index, row in df.iterrows():
   ....:     df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])

4 个解决方案

#1


128  

>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

If you need id as column:

如果您需要id作为列:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

To get n first records, you can use head():

要获得n个第一个记录,可以使用head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth

#2


29  

This will give you the second row of each group (zero indexed, nth(0) is the same as first()):

这将为您提供每组的第二行(零索引,nth(0)与first()相同):

df.groupby('id').nth(1) 

Documentation: http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

文档:http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

#3


5  

I'd suggest to use .nth(0) rather than .first() if you need to get the first row.

如果你需要获得第一行,我建议使用.nth(0)而不是.first()。

The difference between them is how they handle NaNs, so .nth(0) will return the first row of group no matter what are the values in this row, while .first() will eventually return the first not NaN values in each column.

它们之间的区别在于它们如何处理NaN,因此.nth(0)将返回组的第一行,无论该行中的值是什么,而.first()最终将返回每列中的第一个非NaN值。

E.g. if your dataset is :

例如。如果您的数据集是:

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first

And

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first

#4


4  

maybe this is what you want

也许这就是你想要的

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)
                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31
df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55

#1


128  

>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

If you need id as column:

如果您需要id作为列:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

To get n first records, you can use head():

要获得n个第一个记录,可以使用head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth

#2


29  

This will give you the second row of each group (zero indexed, nth(0) is the same as first()):

这将为您提供每组的第二行(零索引,nth(0)与first()相同):

df.groupby('id').nth(1) 

Documentation: http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

文档:http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

#3


5  

I'd suggest to use .nth(0) rather than .first() if you need to get the first row.

如果你需要获得第一行,我建议使用.nth(0)而不是.first()。

The difference between them is how they handle NaNs, so .nth(0) will return the first row of group no matter what are the values in this row, while .first() will eventually return the first not NaN values in each column.

它们之间的区别在于它们如何处理NaN,因此.nth(0)将返回组的第一行,无论该行中的值是什么,而.first()最终将返回每列中的第一个非NaN值。

E.g. if your dataset is :

例如。如果您的数据集是:

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first

And

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first

#4


4  

maybe this is what you want

也许这就是你想要的

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)
                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31
df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55