删除具有重复索引的行(熊猫DataFrame和TimeSeries)

时间:2023-02-07 15:25:37

I'm reading some automated weather data from the web. The observations occur every 5 minutes and are compiled into monthly files for each weather station. Once I'm done parsing a file, the DataFrame looks something like this:

我正在从网上阅读一些自动的天气数据。每5分钟进行一次观测,并将每个气象站的月记录汇编成文件。解析完一个文件后,DataFrame看起来是这样的:

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress
Date                                                                                      
2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31
2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30
2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30
2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30
2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28

The problem I'm having is that sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file. Simple example of such a case is illustrated below:

我遇到的问题是,有时科学家会回去纠正观察结果——不是编辑错误的行,而是在文件的末尾添加重复的行。这种情况的简单例子如下:

import pandas 
import datetime
startdate = datetime.datetime(2001, 1, 1, 0, 0)
enddate = datetime.datetime(2001, 1, 1, 5, 0)
index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H')
data1 = {'A' : range(6), 'B' : range(6)}
data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]}
df1 = pandas.DataFrame(data=data1, index=index)
df2 = pandas.DataFrame(data=data2, index=index[:3])
df3 = df2.append(df1)
df3
                       A   B
2001-01-01 00:00:00   20 -50
2001-01-01 01:00:00  -30  60
2001-01-01 02:00:00   40 -70
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2

And so I need df3 to evenutally become:

所以我需要df3最终变成:

                       A   B
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select out the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

我认为添加一列行号(df3['rownum']] = range(df3.shape[0])可以帮助我为DatetimeIndex的任何值选择最底的行,但是我一直在寻找group_by或pivot (or ??? ?)语句来实现这一点。

4 个解决方案

#1


230  

I would suggest using the duplicated method on the Pandas Index itself:

我建议在熊猫指数上使用重复的方法:

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

虽然所有其他方法都可以工作,但是目前所接受的答案对于所提供的示例来说是性能最低的。此外,虽然groupby方法的性能稍差一些,但我发现复制的方法更具可读性。

Using the sample data provided:

使用所提供的样本数据:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop

Note that you can keep the last element by changing the keep argument.

注意,可以通过更改keep参数来保留最后一个元素。

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

还需要注意的是,该方法也适用于MultiIndex(使用Paul的示例中指定的df1):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop

#2


93  

Note, there is a better answer (below) based on the latest Pandas

This should be the accepted answer.

这应该是公认的答案。

My original answer, which is now outdated, kept for reference.

A simple solution is to use drop_duplicates

一个简单的解决方案是使用drop_duplicate

df4 = df3.drop_duplicates(subset='rownum', keep='last')

For me, this operated quickly on large data sets.

对我来说,这在大型数据集上运行得很快。

This requires that 'rownum' be the column with duplicates. In the modified example, 'rownum' has no duplicates, therefore nothing gets eliminated. What we really want is to have the 'cols' be set to the index. I've not found a way to tell drop_duplicates to only consider the index.

这要求“rownum”是具有重复的列。在修改后的示例中,“rownum”没有重复,因此没有删除任何内容。我们真正想要的是将“cols”设置为索引。我还没有找到一种方法来告诉drop_duplicate只考虑索引。

Here is a solution that adds the index as a dataframe column, drops duplicates on that, then removes the new column:

这里有一个解决方案,将索引添加为dataframe列,删除重复的索引,然后删除新的列:

df3 = df3.reset_index().drop_duplicates(subset='index', keep='last').set_index('index')

And if you want things back in the proper order, just call sort on the dataframe.

如果你想要按正常顺序返回,只需调用dataframe。

df3 = df3.sort()

#3


55  

Oh my. This is actually so simple!

哦,我的。这其实很简单!

grouped = df3.groupby(level=0)
df4 = grouped.last()
df4
                      A   B  rownum

2001-01-01 00:00:00   0   0       6
2001-01-01 01:00:00   1   1       7
2001-01-01 02:00:00   2   2       8
2001-01-01 03:00:00   3   3       3
2001-01-01 04:00:00   4   4       4
2001-01-01 05:00:00   5   5       5

Follow up edit 2013-10-29 In the case where I have a fairly complex MultiIndex, I think I prefer the groupby approach. Here's simple example for posterity:

在我有一个相当复杂的多索引的情况下,我想我更喜欢groupby方法。给子孙后代举个简单的例子:

import numpy as np
import pandas

# fake index
idx = pandas.MultiIndex.from_tuples([('a', letter) for letter in list('abcde')])

# random data + naming the index levels
df1 = pandas.DataFrame(np.random.normal(size=(5,2)), index=idx, columns=['colA', 'colB'])
df1.index.names = ['iA', 'iB']

# artificially append some duplicate data
df1 = df1.append(df1.select(lambda idx: idx[1] in ['c', 'e']))
df1
#           colA      colB
#iA iB                    
#a  a  -1.297535  0.691787
#   b  -1.688411  0.404430
#   c   0.275806 -0.078871
#   d  -0.509815 -0.220326
#   e  -0.066680  0.607233
#   c   0.275806 -0.078871  # <--- dup 1
#   e  -0.066680  0.607233  # <--- dup 2

and here's the important part

这是重要的部分

# group the data, using df1.index.names tells pandas to look at the entire index
groups = df1.groupby(level=df1.index.names)  
groups.last() # or .first()
#           colA      colB
#iA iB                    
#a  a  -1.297535  0.691787
#   b  -1.688411  0.404430
#   c   0.275806 -0.078871
#   d  -0.509815 -0.220326
#   e  -0.066680  0.607233

#4


4  

Unfortunately, I don't think Pandas allows one to drop dups off the indices. I would suggest the following:

不幸的是,我不认为熊猫会让人从指数上掉下来。我建议如下:

df3 = df3.reset_index() # makes date column part of your data
df3.columns = ['timestamp','A','B','rownum'] # set names
df3 = df3.drop_duplicates('timestamp',take_last=True).set_index('timestamp') #done!

#1


230  

I would suggest using the duplicated method on the Pandas Index itself:

我建议在熊猫指数上使用重复的方法:

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

虽然所有其他方法都可以工作,但是目前所接受的答案对于所提供的示例来说是性能最低的。此外,虽然groupby方法的性能稍差一些,但我发现复制的方法更具可读性。

Using the sample data provided:

使用所提供的样本数据:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop

Note that you can keep the last element by changing the keep argument.

注意,可以通过更改keep参数来保留最后一个元素。

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

还需要注意的是,该方法也适用于MultiIndex(使用Paul的示例中指定的df1):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop

#2


93  

Note, there is a better answer (below) based on the latest Pandas

This should be the accepted answer.

这应该是公认的答案。

My original answer, which is now outdated, kept for reference.

A simple solution is to use drop_duplicates

一个简单的解决方案是使用drop_duplicate

df4 = df3.drop_duplicates(subset='rownum', keep='last')

For me, this operated quickly on large data sets.

对我来说,这在大型数据集上运行得很快。

This requires that 'rownum' be the column with duplicates. In the modified example, 'rownum' has no duplicates, therefore nothing gets eliminated. What we really want is to have the 'cols' be set to the index. I've not found a way to tell drop_duplicates to only consider the index.

这要求“rownum”是具有重复的列。在修改后的示例中,“rownum”没有重复,因此没有删除任何内容。我们真正想要的是将“cols”设置为索引。我还没有找到一种方法来告诉drop_duplicate只考虑索引。

Here is a solution that adds the index as a dataframe column, drops duplicates on that, then removes the new column:

这里有一个解决方案,将索引添加为dataframe列,删除重复的索引,然后删除新的列:

df3 = df3.reset_index().drop_duplicates(subset='index', keep='last').set_index('index')

And if you want things back in the proper order, just call sort on the dataframe.

如果你想要按正常顺序返回,只需调用dataframe。

df3 = df3.sort()

#3


55  

Oh my. This is actually so simple!

哦,我的。这其实很简单!

grouped = df3.groupby(level=0)
df4 = grouped.last()
df4
                      A   B  rownum

2001-01-01 00:00:00   0   0       6
2001-01-01 01:00:00   1   1       7
2001-01-01 02:00:00   2   2       8
2001-01-01 03:00:00   3   3       3
2001-01-01 04:00:00   4   4       4
2001-01-01 05:00:00   5   5       5

Follow up edit 2013-10-29 In the case where I have a fairly complex MultiIndex, I think I prefer the groupby approach. Here's simple example for posterity:

在我有一个相当复杂的多索引的情况下,我想我更喜欢groupby方法。给子孙后代举个简单的例子:

import numpy as np
import pandas

# fake index
idx = pandas.MultiIndex.from_tuples([('a', letter) for letter in list('abcde')])

# random data + naming the index levels
df1 = pandas.DataFrame(np.random.normal(size=(5,2)), index=idx, columns=['colA', 'colB'])
df1.index.names = ['iA', 'iB']

# artificially append some duplicate data
df1 = df1.append(df1.select(lambda idx: idx[1] in ['c', 'e']))
df1
#           colA      colB
#iA iB                    
#a  a  -1.297535  0.691787
#   b  -1.688411  0.404430
#   c   0.275806 -0.078871
#   d  -0.509815 -0.220326
#   e  -0.066680  0.607233
#   c   0.275806 -0.078871  # <--- dup 1
#   e  -0.066680  0.607233  # <--- dup 2

and here's the important part

这是重要的部分

# group the data, using df1.index.names tells pandas to look at the entire index
groups = df1.groupby(level=df1.index.names)  
groups.last() # or .first()
#           colA      colB
#iA iB                    
#a  a  -1.297535  0.691787
#   b  -1.688411  0.404430
#   c   0.275806 -0.078871
#   d  -0.509815 -0.220326
#   e  -0.066680  0.607233

#4


4  

Unfortunately, I don't think Pandas allows one to drop dups off the indices. I would suggest the following:

不幸的是,我不认为熊猫会让人从指数上掉下来。我建议如下:

df3 = df3.reset_index() # makes date column part of your data
df3.columns = ['timestamp','A','B','rownum'] # set names
df3 = df3.drop_duplicates('timestamp',take_last=True).set_index('timestamp') #done!