如何将一个df与datetime索引重新采样，精确到n个相同大小的周期?

I've got a large dataframe with a datetime index and need to resample data to exactly 10 equally sized periods.

我用datetime索引获得了一个大的dataframe，并且需要将数据重新采样到恰好10个相同大小的时间段。

So far, I've tried finding the first and last dates to determine the total number of days in the data, divide that by 10 to determine the size of each period, then resample using that number of days. eg:

到目前为止，我已经尝试找到第一个和最后一个日期来确定数据的总天数，除以10来确定每个周期的大小，然后再用那个天数重新采样。例如:

first = df.reset_index().timesubmit.min()
last = df.reset_index().timesubmit.max()
periodsize = str((last-first).days/10) + 'D'

df.resample(periodsize,how='sum')

This doesn't guarantee exactly 10 periods in the df after resampling since the periodsize is a rounded down int. Using a float doesn't work in the resampling. Seems that either there's something simple that I'm missing here, or I'm attacking the problem all wrong.

这并不能保证在重复采样后的df中正好有10个周期，因为它是一个圆形的向下的整数。在重采样中使用浮点数是无效的。似乎这里有个简单的东西，我在这里漏掉了，或者我在攻击这个问题。

2 个解决方案

#1

import numpy as np
import pandas as pd

n = 10
nrows = 33
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
print(df)
#             0
# 2000-01-01  1
# 2000-01-02  1
# ...
# 2000-02-01  1
# 2000-02-02  1

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
secs = int((last-first).total_seconds()//n)
periodsize = '{:d}S'.format(secs)

result = df.resample(periodsize, how='sum')
print('\n{}'.format(result))
assert len(result) == n

yields

收益率

                     0
2000-01-01 00:00:00  4
2000-01-04 07:12:00  3
2000-01-07 14:24:00  3
2000-01-10 21:36:00  4
2000-01-14 04:48:00  3
2000-01-17 12:00:00  3
2000-01-20 19:12:00  4
2000-01-24 02:24:00  3
2000-01-27 09:36:00  3
2000-01-30 16:48:00  3

The values in the 0-column indicate the number of rows that were aggregated, since the original DataFrame was filled with values of 1. The pattern of 4's and 3's is about as even as you can get since 33 rows can not be evenly grouped into 10 groups.

0列中的值表示聚合的行数，因为原始的DataFrame填充了1的值。4和3的模式就像你能得到的一样因为33行不能被均匀地分成10组。

Explanation: Consider this simpler DataFrame:

说明:考虑这个简单的DataFrame:

n = 2
nrows = 5
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
#             0
# 2000-01-01  1
# 2000-01-02  1
# 2000-01-03  1
# 2000-01-04  1
# 2000-01-05  1

Using df.resample('2D', how='sum') gives the wrong number of groups

使用df。resample('2D'， how='sum')给出了错误的组数。

In [366]: df.resample('2D', how='sum')
Out[366]: 
            0
2000-01-01  2
2000-01-03  2
2000-01-05  1

Using df.resample('3D', how='sum') gives the right number of groups, but the second group starts at 2000-01-04 which does not evenly divide the DataFrame into two equally-spaced groups:

使用df。resample('3D'， how='sum')给出了正确的组数，但是第二组从2000-01-04开始，它不把DataFrame平均分为两个等距组:

In [367]: df.resample('3D', how='sum')
Out[367]: 
            0
2000-01-01  3
2000-01-04  2

To do better, we need to work at a finer time resolution than in days. Since Timedeltas have a total_seconds method, let's work in seconds. So for the example above, the desired frequency string would be

为了做得更好，我们需要在更精细的时间解决问题。由于Timedeltas有total_seconds方法，让我们以秒为单位工作。对于上面的例子，期望的频率字符串是。

In [374]: df.resample('216000S', how='sum')
Out[374]: 
                     0
2000-01-01 00:00:00  3
2000-01-03 12:00:00  2

since there are 216000*2 seconds in 5 days:

由于在5天内有216000*2秒:

In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2
Out[373]: 216000.0

Okay, so now all we need is a way to generalize this. We'll want the minimum and maximum dates in the index:

好了，现在我们需要的是一种一般化的方法。我们想要索引中的最小值和最大值:

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')

We add an extra day because it makes the difference in days come out right. In the example above, There are only 4 days between the Timestamps for 2000-01-05 and 2000-01-01,

我们增加额外的一天，因为它使不同的日子出来对。在上面的例子中，在2000-01-05和2000-01-01之间的时间戳中只有4天，

In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days
Out[378]: 4

But as we can see in the worked example, the DataFrame has 5 rows representing 5 days. So it makes sense that we need to add an extra day.

但是正如我们在工作示例中看到的，DataFrame有5行表示5天。所以我们需要多增加一天。

Now we can compute the correct number of seconds in each equally-spaced group with:

现在我们可以计算每个等距组的正确秒数:

secs = int((last-first).total_seconds()//n)

#2

Here is one way to ensure equal-size sub-periods by using np.linspace() on pd.Timedelta and then classifying each obs into different bins using pd.cut.

这里有一种方法可以确保在pd上使用np.linspace()来确保相等大小的子周期。然后将每个obs分类到不同的垃圾箱中。

import pandas as pd
import numpy as np

# generate artificial data
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H'))

Out[87]: 
                          A       B
2015-01-01 00:00:00  1.7641  0.4002
2015-01-01 08:00:00  0.9787  2.2409
2015-01-01 16:00:00  1.8676 -0.9773
2015-01-02 00:00:00  0.9501 -0.1514
2015-01-02 08:00:00 -0.1032  0.4106
2015-01-02 16:00:00  0.1440  1.4543
2015-01-03 00:00:00  0.7610  0.1217
2015-01-03 08:00:00  0.4439  0.3337
2015-01-03 16:00:00  1.4941 -0.2052
2015-01-04 00:00:00  0.3131 -0.8541
2015-01-04 08:00:00 -2.5530  0.6536
2015-01-04 16:00:00  0.8644 -0.7422
2015-01-05 00:00:00  2.2698 -1.4544
2015-01-05 08:00:00  0.0458 -0.1872
2015-01-05 16:00:00  1.5328  1.4694
...                     ...     ...
2015-01-29 08:00:00  0.9209  0.3187
2015-01-29 16:00:00  0.8568 -0.6510
2015-01-30 00:00:00 -1.0342  0.6816
2015-01-30 08:00:00 -0.8034 -0.6895
2015-01-30 16:00:00 -0.4555  0.0175
2015-01-31 00:00:00 -0.3540 -1.3750
2015-01-31 08:00:00 -0.6436 -2.2234
2015-01-31 16:00:00  0.6252 -1.6021
2015-02-01 00:00:00 -1.1044  0.0522
2015-02-01 08:00:00 -0.7396  1.5430
2015-02-01 16:00:00 -1.2929  0.2671
2015-02-02 00:00:00 -0.0393 -1.1681
2015-02-02 08:00:00  0.5233 -0.1715
2015-02-02 16:00:00  0.7718  0.8235
2015-02-03 00:00:00  2.1632  1.3365

[100 rows x 2 columns]


# cutoff points, 10 equal-size group requires 11 points
# measured by timedelta 1 hour
time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h')
n = 10
ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1)
# labels, time index
time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff])

# create a categorical reference variables
df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1])
# for clarity, reassign labels using end-period index
df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:])

Out[89]: 
                          A       B    start_time_index      end_time_index
2015-01-01 00:00:00  1.7641  0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 08:00:00  0.9787  2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 16:00:00  1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 00:00:00  0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 08:00:00 -0.1032  0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 16:00:00  0.1440  1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 00:00:00  0.7610  0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 08:00:00  0.4439  0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 16:00:00  1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 00:00:00  0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 08:00:00 -2.5530  0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-04 16:00:00  0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 00:00:00  2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 08:00:00  0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 16:00:00  1.5328  1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00
...                     ...     ...                 ...                 ...
2015-01-29 08:00:00  0.9209  0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-29 16:00:00  0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 00:00:00 -1.0342  0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 16:00:00 -0.4555  0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 16:00:00  0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 00:00:00 -1.1044  0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 08:00:00 -0.7396  1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 16:00:00 -1.2929  0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 08:00:00  0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 16:00:00  0.7718  0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-03 00:00:00  2.1632  1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00

[100 rows x 4 columns]

df.groupby('start_time_index').agg('sum')

Out[90]: 
                          A       B
start_time_index                   
2015-01-01 00:00:00  8.6133  2.7734
2015-01-04 07:12:00  1.9220 -0.8069
2015-01-07 14:24:00 -8.1334  0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00  1.1957  7.2285
2015-01-17 12:00:00  3.2485  6.6841
2015-01-20 19:12:00 -0.8903  2.2802
2015-01-24 02:24:00 -2.1025  1.3800
2015-01-27 09:36:00 -1.1017  1.3108
2015-01-30 16:48:00 -0.0902 -2.5178

Another potential shorter way to do this is to specify your sampling freq as the time delta. But the problem, as shown in below, is that it delivers 11 sub-samples instead of 10. I believe the reason is that the resample implements a left-inclusive/right-exclusive (or left-exclusive/right-inclusive) sub-sampling scheme so that the very last obs at '2015-02-03 00:00:00' is considered as a separate group. If we use pd.cut to do it ourself, we can specify include_lowest=True so that it gives us exactly 10 sub-samples rather than 11.

另一种可能的更短的方法是指定你的采样freq作为时间增量。但是，如下所示，问题在于它提供了11个子样本而不是10个。我认为原因是，resample实现了一个左/右/右(或左/右)的子抽样方案，这样在“2015-02-03 00:00:00”的最后一个obs被认为是一个单独的组。如果我们使用pd。cut to do it ourself，我们可以指定include_low =True，所以它正好给我们10个子样本，而不是11个。

n = 10
time_delta_str = str((df.index[-1] - df.index[0]) / (pd.Timedelta('1s') * n)) + 's'
df.resample(pd.Timedelta(time_delta_str), how='sum')

Out[114]: 
                          A       B
2015-01-01 00:00:00  8.6133  2.7734
2015-01-04 07:12:00  1.9220 -0.8069
2015-01-07 14:24:00 -8.1334  0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00  1.1957  7.2285
2015-01-17 12:00:00  3.2485  6.6841
2015-01-20 19:12:00 -0.8903  2.2802
2015-01-24 02:24:00 -2.1025  1.3800
2015-01-27 09:36:00 -1.1017  1.3108
2015-01-30 16:48:00 -2.2534 -3.8543
2015-02-03 00:00:00  2.1632  1.3365

#1

import numpy as np
import pandas as pd

n = 10
nrows = 33
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
print(df)
#             0
# 2000-01-01  1
# 2000-01-02  1
# ...
# 2000-02-01  1
# 2000-02-02  1

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
secs = int((last-first).total_seconds()//n)
periodsize = '{:d}S'.format(secs)

result = df.resample(periodsize, how='sum')
print('\n{}'.format(result))
assert len(result) == n