熊猫DataFrame:如何在不同的行和列之间获得最小值。

时间:2022-12-13 22:58:37

I have a Pandas DataFrame that looks similar to this but with 10,000 rows and 500 columns.

我有一个熊猫DataFrame,它看起来和这个相似,但是有10000行和500列。

熊猫DataFrame:如何在不同的行和列之间获得最小值。

For each row, I would like to find the minimum value between 3 days ago at 15:00 and today at 13:30.

对于每一行,我希望找到3天前15:00到今天13:30之间的最小值。

Is there some native numpy way to do this quickly? My goal is to be able to get the minimum value for each row by saying something like "what is the minimum value from 3 days ago ago 15:00 to 0 days ago (aka today) 13:30?"

是否有一些本地的麻木的方法来快速完成?我的目标是能够得到每一行的最小值,比如“从3天前15:00到0天前(即今天)13:30的最小值是多少?”

For this particular example the answers for the last two rows would be:

对于这个特定的例子,最后两行的答案是:

2011-01-09 2481.22
2011-01-10 2481.22

My current way is this:

我目前的方式是:

1. Get the earliest row (only the values after the start time)
2. Get the middle rows 
3. Get the last row (only the values before the end time)
4. Concat (1), (2), and (3)
5. Get the minimum of (4)

But this takes a very long time on a large DataFrame

但是这在一个大的DataFrame上花费了很长时间


The following code will generate a similar DF:

以下代码将生成类似的DF:

import numpy
import pandas
import datetime

numpy.random.seed(0)

random_numbers = (numpy.random.rand(10, 8)*100 + 2000)
columns        = [datetime.time(13,0) , datetime.time(13,30), datetime.time(14,0), datetime.time(14,30) , datetime.time(15,0), datetime.time(15,30) ,datetime.time(16,0), datetime.time(16,30)] 
index          = pandas.date_range('2011/1/1', '2011/1/10')
df             = pandas.DataFrame(data = random_numbers, columns=columns, index = index).astype(int)

print df

Here is the json version of the dataframe:

下面是dataframe的json版本:

'{"13:00:00":{"1293840000000":2085,"1293926400000":2062,"1294012800000":2035,"1294099200000":2086,"1294185600000":2006,"1294272000000":2097,"1294358400000":2078,"1294444800000":2055,"1294531200000":2023,"1294617600000":2024},"13:30:00":{"1293840000000":2045,"1293926400000":2039,"1294012800000":2035,"1294099200000":2045,"1294185600000":2025,"1294272000000":2099,"1294358400000":2028,"1294444800000":2028,"1294531200000":2034,"1294617600000":2010},"14:00:00":{"1293840000000":2095,"1293926400000":2006,"1294012800000":2001,"1294099200000":2032,"1294185600000":2022,"1294272000000":2040,"1294358400000":2024,"1294444800000":2070,"1294531200000":2081,"1294617600000":2095},"14:30:00":{"1293840000000":2057,"1293926400000":2042,"1294012800000":2018,"1294099200000":2023,"1294185600000":2025,"1294272000000":2016,"1294358400000":2066,"1294444800000":2041,"1294531200000":2098,"1294617600000":2023},"15:00:00":{"1293840000000":2082,"1293926400000":2025,"1294012800000":2040,"1294099200000":2061,"1294185600000":2013,"1294272000000":2063,"1294358400000":2024,"1294444800000":2036,"1294531200000":2096,"1294617600000":2068},"15:30:00":{"1293840000000":2090,"1293926400000":2084,"1294012800000":2092,"1294099200000":2003,"1294185600000":2001,"1294272000000":2049,"1294358400000":2066,"1294444800000":2082,"1294531200000":2090,"1294617600000":2005},"16:00:00":{"1293840000000":2081,"1293926400000":2003,"1294012800000":2009,"1294099200000":2001,"1294185600000":2011,"1294272000000":2098,"1294358400000":2051,"1294444800000":2092,"1294531200000":2029,"1294617600000":2073},"16:30:00":{"1293840000000":2015,"1293926400000":2095,"1294012800000":2094,"1294099200000":2042,"1294185600000":2061,"1294272000000":2006,"1294358400000":2042,"1294444800000":2004,"1294531200000":2099,"1294617600000":2088}}'

”{ 13:00:00”:{“1293840000000”:2085年,“1293926400000”:2062年,“1294012800000”:2035年,“1294099200000”:2086年,“1294185600000”:2006年,“1294272000000”:2097年,“1294358400000”:2078年,“1294444800000”:2055年,“1294531200000”:2023年,“1294617600000”:2024 },“13:30:00”:{“1293840000000”:2045年,“1293926400000”:2039年,“1294012800000”:2035年,“1294099200000”:2045年,“1294185600000”:2025年,“1294272000000”:2099年,“1294358400000”:2028年,“1294444800000”:2028年,“1294531200000”:2034年,“1294617600000”:2010 },“14:00:00”:{“1293926400000”“1293840000000”:2095:2006、“1294012800000”:2001年,“1294099200000”:2032年,“1294185600000”:2022年,“1294272000000”:2040年,“1294358400000”:2024年,“1294444800000”:2070年,“1294531200000”:2081年,“1294617600000”:2095 },“14:30:00”:{“1293840000000”:2057年,“1293926400000”:2042年,“1294012800000”:2018年,“1294099200000”:2023年,“1294185600000”:2025年,“1294272000000”:2016年,“1294358400000”:2066年,“1294444800000”:2041年,“1294531200000”:2098年,“1294617600000”:2023 },“15:00:00”:{“1293840000000”:2082年,“1293926400000”:2025年,“1294012800000”:2040年,“1294099200000”:2061年,“1294185600000”:2013年,“1294272000000”:2063年,“1294358400000”:2024年,“1294444800000”:2036年,“1294531200000”:2096年,“1294617600000”:2068 },“15:30:00 ":{“1293840000000”:2090年,“1293926400000”:2084年,“1294012800000”:2092年,“1294099200000”:2003年,“1294185600000”:2001年,“1294272000000”:2049年,“1294358400000”:2066年,“1294444800000”:2082年,“1294531200000”:2090年,“1294617600000”:2005 },“16:00:00”:{“1293840000000”:2081年,“1293926400000”:2003年,“1294012800000”:2009年,“1294099200000”:2001年,“1294185600000”:2011年,“1294272000000”:2098年,“1294358400000”:2051年,“1294444800000”:2092年,“1294531200000”:2029年,“1294617600000 16:30:00“:2073 }:{“1293840000000”:2015年,“1293926400000”:2095年,“1294012800000”:2094年,“1294099200000”:2042年,“1294185600000”:2061年,“1294272000000”:2006年,“1294358400000”:2042年,“1294444800000”:2004年,“1294531200000”:2099年,“1294617600000”:2088 } }’

4 个解决方案

#1


9  

You can first stack the DataFrame to create a series and then index slice it as required and take the min. For example:

您可以先将DataFrame堆叠成一个序列,然后根据需要对其进行索引切片,然后取最小值。

first, last = ('2011-01-07', datetime.time(15)), ('2011-01-10', datetime.time(13, 30))
df.stack().loc[first: last].min()

The result of df.stack is a Series with a MultiIndex where the inner level is composed of the original columns. We then slice using tuple pairs with the start and end date and times. If you're going to be doing lots of such operations then you should consider assigning df.stack() to some variable. You might then consider changing the index to a proper DatetimeIndex. Then you can work with both the time series and the grid format as required.

df的结果。堆栈是一个多索引的系列,其中内部级别由原始列组成。然后,我们使用元组对分割开始和结束日期和时间。如果您要做很多这样的操作,那么您应该考虑为某个变量分配df.stack()。然后,您可以考虑将索引更改为适当的DatetimeIndex。然后您可以根据需要使用时间序列和网格格式。

Here's another method which avoids stacking and is a lot faster on DataFrames of the size you're actually working with (as a one-off; slicing the stacked DataFrame is a lot faster once it's stacked so if you're doing many of these operations you should stack and convert the index).
It's less general as it works with min and max but not with, say, mean. It gets the min of the subset of the first and last rows and the min of the rows in between (if any) and takes the min of these three candidates.

这是另一种方法,它可以避免堆叠,并且可以更快地处理实际使用的大小的数据aframes(作为一次性处理);如果将堆叠的DataFrame进行分层,那么如果您正在执行许多这样的操作,那么您应该堆栈并转换索引。它不那么普遍,因为它与最小值和最大值一起工作,但不是,比如说,平均值。它获取第一行和最后一行的子集的最小值和中间行(如果有的话)的最小值,并获取这三个候选项的最小值。

first_row = df.index.get_loc(first[0])
last_row = df.index.get_loc(last[0])
if first_row == last_row:
    result = df.loc[first[0], first[1]: last[1]].min()
elif first_row < last_row:
    first_row_min = df.loc[first[0], first[1]:].min()
    last_row_min = df.loc[last[0], :last[1]].min()
    middle_min = df.iloc[first_row + 1:last_row].min().min()
    result = min(first_row_min, last_row_min, middle_min)
else: 
    raise ValueError('first row must be <= last row')

Note that if first_row + 1 == last_row then middle_min is nan but the result is still correct as long as middle_min doesn't come first in the call to min.

注意,如果first_row + 1 = last_row,那么middle_min就是nan,但是只要middle_min在调用min时不排在第一位,那么结果仍然是正确的。

#2


6  

Take the following example, it is easier to understand.

以下面的例子为例,它更容易理解。

|            | 13:00:00 | 13:30:00 | 14:00:00 | 14:30:00 | 15:00:00 | 15:30:00 | 16:00:00 | 16:30:00 | 
|------------|----------|----------|----------|----------|----------|----------|----------|----------| 
| 2011-01-01 | 2054     | 2071     | 2060     | 2054     | 2042     | 2064     | 2043     | 2089     | 
| 2011-01-02 | 2096     | 2038     | 2079     | 2052     | 2056     | 2092     | 2007     | 2008     | 
| 2011-01-03 | 2002     | 2083     | 2077     | 2087     | 2097     | 2079     | 2046     | 2078     | 
| 2011-01-04 | 2011     | 2063     | 2014     | 2094     | 2052     | 2041     | 2026     | 2077     | 
| 2011-01-05 | 2045     | 2056     | 2001     | 2061     | 2061     | 2061     | 2094     | 2068     | 
| 2011-01-06 | 2035     | 2043     | 2069     | 2006     | 2066     | 2067     | 2021     | 2012     | 
| 2011-01-07 | 2031     | 2036     | 2057     | 2043     | 2098     | 2010     | 2020     | 2016     | 
| 2011-01-08 | 2065     | 2025     | 2046     | 2024     | 2015     | 2011     | 2065     | 2013     | 
| 2011-01-09 | 2019     | 2036     | 2082     | 2009     | 2083     | 2009     | 2097     | 2046     | 
| 2011-01-10 | 2097     | 2060     | 2073     | 2003     | 2028     | 2012     | 2029     | 2011     | 

Let say we want to find the min from (2, b) to (6, d) for each row.

假设我们想求每一行(2,b)到(6,d)的最小值。

We can just fill the undesired data of the first and the last row by np.inf.

我们可以用np.inf填充第一行和最后一行的不需要的数据。

df.loc["2011-01-07", :datetime.time(15, 0)] = np.inf
df.loc["2011-01-10", datetime.time(13, 30):] = np.inf

you get

你得到

|            | 13:00:00 | 13:30:00 | 14:00:00 | 14:30:00 | 15:00:00 | 15:30:00 | 16:00:00 | 16:30:00 | 
|------------|----------|----------|----------|----------|----------|----------|----------|----------| 
| 2011-01-01 | 2054.0   | 2071.0   | 2060.0   | 2054.0   | 2042.0   | 2064.0   | 2043.0   | 2089.0   | 
| 2011-01-02 | 2096.0   | 2038.0   | 2079.0   | 2052.0   | 2056.0   | 2092.0   | 2007.0   | 2008.0   | 
| 2011-01-03 | 2002.0   | 2083.0   | 2077.0   | 2087.0   | 2097.0   | 2079.0   | 2046.0   | 2078.0   | 
| 2011-01-04 | 2011.0   | 2063.0   | 2014.0   | 2094.0   | 2052.0   | 2041.0   | 2026.0   | 2077.0   | 
| 2011-01-05 | 2045.0   | 2056.0   | 2001.0   | 2061.0   | 2061.0   | 2061.0   | 2094.0   | 2068.0   | 
| 2011-01-06 | 2035.0   | 2043.0   | 2069.0   | 2006.0   | 2066.0   | 2067.0   | 2021.0   | 2012.0   | 
| 2011-01-07 | inf      | inf      | inf      | inf      | inf      | 2010.0   | 2020.0   | 2016.0   | 
| 2011-01-08 | 2065.0   | 2025.0   | 2046.0   | 2024.0   | 2015.0   | 2011.0   | 2065.0   | 2013.0   | 
| 2011-01-09 | 2019.0   | 2036.0   | 2082.0   | 2009.0   | 2083.0   | 2009.0   | 2097.0   | 2046.0   | 
| 2011-01-10 | 2097.0   | inf      | inf      | inf      | inf      | inf      | inf      | inf      | 

In order to get the result:

为了得到结果:

df.loc["2011-01-07": "2011-01-10", :].idxmin(axis=1)

2011-01-07    15:30:00
2011-01-08    15:30:00
2011-01-09    14:30:00
2011-01-10    13:00:00
Freq: D, dtype: object

#3


6  

A hacky way, but should be fast, is to concat the shifted DataFrames:

一种陈腐但应该很快的方法是使用移位的DataFrames:

In [11]: df.shift(1)
Out[11]:
            13:00:00  13:30:00  14:00:00  14:30:00  15:00:00  15:30:00  16:00:00  16:30:00
2011-01-01       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN
2011-01-02      2054      2071      2060      2054      2042      2064      2043      2089
2011-01-03      2096      2038      2079      2052      2056      2092      2007      2008
2011-01-04      2002      2083      2077      2087      2097      2079      2046      2078
2011-01-05      2011      2063      2014      2094      2052      2041      2026      2077
2011-01-06      2045      2056      2001      2061      2061      2061      2094      2068
2011-01-07      2035      2043      2069      2006      2066      2067      2021      2012
2011-01-08      2031      2036      2057      2043      2098      2010      2020      2016
2011-01-09      2065      2025      2046      2024      2015      2011      2065      2013
2011-01-10      2019      2036      2082      2009      2083      2009      2097      2046

In [12]: df.shift(2).iloc[:, 4:]
Out[12]:
            15:00:00  15:30:00  16:00:00  16:30:00
2011-01-01       NaN       NaN       NaN       NaN
2011-01-02       NaN       NaN       NaN       NaN
2011-01-03      2042      2064      2043      2089
2011-01-04      2056      2092      2007      2008
2011-01-05      2097      2079      2046      2078
2011-01-06      2052      2041      2026      2077
2011-01-07      2061      2061      2094      2068
2011-01-08      2066      2067      2021      2012
2011-01-09      2098      2010      2020      2016
2011-01-10      2015      2011      2065      2013

In [13]: pd.concat([df.iloc[:, :1], df.shift(1), df.shift(2).iloc[:, 4:]], axis=1)
Out[13]:
            13:00:00  13:00:00  13:30:00  14:00:00  14:30:00  15:00:00  15:30:00  16:00:00  16:30:00  15:00:00  15:30:00  16:00:00  16:30:00
2011-01-01      2054       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN
2011-01-02      2096      2054      2071      2060      2054      2042      2064      2043      2089       NaN       NaN       NaN       NaN
2011-01-03      2002      2096      2038      2079      2052      2056      2092      2007      2008      2042      2064      2043      2089
2011-01-04      2011      2002      2083      2077      2087      2097      2079      2046      2078      2056      2092      2007      2008
2011-01-05      2045      2011      2063      2014      2094      2052      2041      2026      2077      2097      2079      2046      2078
2011-01-06      2035      2045      2056      2001      2061      2061      2061      2094      2068      2052      2041      2026      2077
2011-01-07      2031      2035      2043      2069      2006      2066      2067      2021      2012      2061      2061      2094      2068
2011-01-08      2065      2031      2036      2057      2043      2098      2010      2020      2016      2066      2067      2021      2012
2011-01-09      2019      2065      2025      2046      2024      2015      2011      2065      2013      2098      2010      2020      2016
2011-01-10      2097      2019      2036      2082      2009      2083      2009      2097      2046      2015      2011      2065      2013

and take the minimum across the columns (making sure you discard the columns which are too early or too late on a given day:

在每一列中取最小值(确保在给定的一天丢弃那些太早或太晚的列:

In [14]: pd.concat([df.iloc[:, :1], df.shift(1), df.shift(2).iloc[:, 4:]], axis=1).min(1)
Out[14]:
2011-01-01    2054
2011-01-02    2042
2011-01-03    2002
2011-01-04    2002
2011-01-05    2011
2011-01-06    2001
2011-01-07    2006
2011-01-08    2010
2011-01-09    2010
2011-01-10    2009
Freq: D, dtype: float64

You can do this more efficiently, but more noisily, by taking the minimum of each shifted DataFrame before concatting:

你可以更有效地做到这一点,但更大声的是,在concatting之前,将每个移位的DataFrame的最小值都取下来:

In [21]: pd.concat([df.iloc[:, :1].min(1),
                    df.shift(1).min(1),
                    df.shift(2).iloc[:, 4:].min(1)],
                   axis=1).min(1)
Out[21]:
2011-01-01    2054
2011-01-02    2042
2011-01-03    2002
2011-01-04    2002
2011-01-05    2011
2011-01-06    2001
2011-01-07    2006
2011-01-08    2010
2011-01-09    2010
2011-01-10    2009
Freq: D, dtype: float64

Either will be significantly faster than looping through days.

这两种方法都比几天内循环要快得多。

#4


5  

I used pandas' stack() method and timeseries object to build the result from the sample data. This approach generalizes well to any arbitrary time range with a few adjustments, and uses pandas built in functionalies to build the result.

我使用了panda()方法和timeseries对象来从示例数据构建结果。这种方法可以很好地推广到任意时间范围,只需进行一些调整,并使用函数中内置的熊猫来构建结果。

import pandas as pd
import datetime as dt
# import df from json
df = pd.read_json('''{"13:00:00":     {"1293840000000":2085,"1293926400000":2062,"1294012800000":2035,"1294099200000":2086,"1294185600000":2006,"1294272000000":2097,"1294358400000":2078,"1294444800000":2055,"1294531200000":2023,"1294617600000":2024},
                      "13:30:00":{"1293840000000":2045,"1293926400000":2039,"1294012800000":2035,"1294099200000":2045,"1294185600000":2025,"1294272000000":2099,"1294358400000":2028,"1294444800000":2028,"1294531200000":2034,"1294617600000":2010},
                      "14:00:00":{"1293840000000":2095,"1293926400000":2006,"1294012800000":2001,"1294099200000":2032,"1294185600000":2022,"1294272000000":2040,"1294358400000":2024,"1294444800000":2070,"1294531200000":2081,"1294617600000":2095},
                      "14:30:00":{"1293840000000":2057,"1293926400000":2042,"1294012800000":2018,"1294099200000":2023,"1294185600000":2025,"1294272000000":2016,"1294358400000":2066,"1294444800000":2041,"1294531200000":2098,"1294617600000":2023},
                      "15:00:00":{"1293840000000":2082,"1293926400000":2025,"1294012800000":2040,"1294099200000":2061,"1294185600000":2013,"1294272000000":2063,"1294358400000":2024,"1294444800000":2036,"1294531200000":2096,"1294617600000":2068},
                      "15:30:00":{"1293840000000":2090,"1293926400000":2084,"1294012800000":2092,"1294099200000":2003,"1294185600000":2001,"1294272000000":2049,"1294358400000":2066,"1294444800000":2082,"1294531200000":2090,"1294617600000":2005},
                      "16:00:00":{"1293840000000":2081,"1293926400000":2003,"1294012800000":2009,"1294099200000":2001,"1294185600000":2011,"1294272000000":2098,"1294358400000":2051,"1294444800000":2092,"1294531200000":2029,"1294617600000":2073},
                      "16:30:00":{"1293840000000":2015,"1293926400000":2095,"1294012800000":2094,"1294099200000":2042,"1294185600000":2061,"1294272000000":2006,"1294358400000":2042,"1294444800000":2004,"1294531200000":2099,"1294617600000":2088}}
                   '''#,convert_axes=False
                    )
date_idx=df.index                    
# stack the data 
stacked = df.stack()
# merge the multindex into a single idx. 
idx_list = stacked.index.tolist()
idx = []
for item in idx_list:
    day = item[0]
    time = item[1]
    idx += [dt.datetime(day.year, day.month, day.day, time.hour, time.minute)]
# make a time series to simplify slicing
timeseries = pd.TimeSeries(stacked.values, index=idx)
# get the results for each date

for i in range(2, len(date_idx)):
    # get the min values for each day in the sample data. 
    start_time='%s 15:00:00'%date_idx[i-2]
    end_time = '%s 13:30:00'%date_idx[i]
    slice_idx =timeseries.index>=start_time 
    slice_idx *= timeseries.index<=end_time
    print "%s %s"%(date_idx[i].date(), timeseries[slice_idx].min())

output:

输出:

2011-01-03 2003
2011-01-04 2001
2011-01-05 2001
2011-01-06 2001
2011-01-07 2001
2011-01-08 2006
2011-01-09 2004
2011-01-10 2004

#1


9  

You can first stack the DataFrame to create a series and then index slice it as required and take the min. For example:

您可以先将DataFrame堆叠成一个序列,然后根据需要对其进行索引切片,然后取最小值。

first, last = ('2011-01-07', datetime.time(15)), ('2011-01-10', datetime.time(13, 30))
df.stack().loc[first: last].min()

The result of df.stack is a Series with a MultiIndex where the inner level is composed of the original columns. We then slice using tuple pairs with the start and end date and times. If you're going to be doing lots of such operations then you should consider assigning df.stack() to some variable. You might then consider changing the index to a proper DatetimeIndex. Then you can work with both the time series and the grid format as required.

df的结果。堆栈是一个多索引的系列,其中内部级别由原始列组成。然后,我们使用元组对分割开始和结束日期和时间。如果您要做很多这样的操作,那么您应该考虑为某个变量分配df.stack()。然后,您可以考虑将索引更改为适当的DatetimeIndex。然后您可以根据需要使用时间序列和网格格式。

Here's another method which avoids stacking and is a lot faster on DataFrames of the size you're actually working with (as a one-off; slicing the stacked DataFrame is a lot faster once it's stacked so if you're doing many of these operations you should stack and convert the index).
It's less general as it works with min and max but not with, say, mean. It gets the min of the subset of the first and last rows and the min of the rows in between (if any) and takes the min of these three candidates.

这是另一种方法,它可以避免堆叠,并且可以更快地处理实际使用的大小的数据aframes(作为一次性处理);如果将堆叠的DataFrame进行分层,那么如果您正在执行许多这样的操作,那么您应该堆栈并转换索引。它不那么普遍,因为它与最小值和最大值一起工作,但不是,比如说,平均值。它获取第一行和最后一行的子集的最小值和中间行(如果有的话)的最小值,并获取这三个候选项的最小值。

first_row = df.index.get_loc(first[0])
last_row = df.index.get_loc(last[0])
if first_row == last_row:
    result = df.loc[first[0], first[1]: last[1]].min()
elif first_row < last_row:
    first_row_min = df.loc[first[0], first[1]:].min()
    last_row_min = df.loc[last[0], :last[1]].min()
    middle_min = df.iloc[first_row + 1:last_row].min().min()
    result = min(first_row_min, last_row_min, middle_min)
else: 
    raise ValueError('first row must be <= last row')

Note that if first_row + 1 == last_row then middle_min is nan but the result is still correct as long as middle_min doesn't come first in the call to min.

注意,如果first_row + 1 = last_row,那么middle_min就是nan,但是只要middle_min在调用min时不排在第一位,那么结果仍然是正确的。

#2


6  

Take the following example, it is easier to understand.

以下面的例子为例,它更容易理解。

|            | 13:00:00 | 13:30:00 | 14:00:00 | 14:30:00 | 15:00:00 | 15:30:00 | 16:00:00 | 16:30:00 | 
|------------|----------|----------|----------|----------|----------|----------|----------|----------| 
| 2011-01-01 | 2054     | 2071     | 2060     | 2054     | 2042     | 2064     | 2043     | 2089     | 
| 2011-01-02 | 2096     | 2038     | 2079     | 2052     | 2056     | 2092     | 2007     | 2008     | 
| 2011-01-03 | 2002     | 2083     | 2077     | 2087     | 2097     | 2079     | 2046     | 2078     | 
| 2011-01-04 | 2011     | 2063     | 2014     | 2094     | 2052     | 2041     | 2026     | 2077     | 
| 2011-01-05 | 2045     | 2056     | 2001     | 2061     | 2061     | 2061     | 2094     | 2068     | 
| 2011-01-06 | 2035     | 2043     | 2069     | 2006     | 2066     | 2067     | 2021     | 2012     | 
| 2011-01-07 | 2031     | 2036     | 2057     | 2043     | 2098     | 2010     | 2020     | 2016     | 
| 2011-01-08 | 2065     | 2025     | 2046     | 2024     | 2015     | 2011     | 2065     | 2013     | 
| 2011-01-09 | 2019     | 2036     | 2082     | 2009     | 2083     | 2009     | 2097     | 2046     | 
| 2011-01-10 | 2097     | 2060     | 2073     | 2003     | 2028     | 2012     | 2029     | 2011     | 

Let say we want to find the min from (2, b) to (6, d) for each row.

假设我们想求每一行(2,b)到(6,d)的最小值。

We can just fill the undesired data of the first and the last row by np.inf.

我们可以用np.inf填充第一行和最后一行的不需要的数据。

df.loc["2011-01-07", :datetime.time(15, 0)] = np.inf
df.loc["2011-01-10", datetime.time(13, 30):] = np.inf

you get

你得到

|            | 13:00:00 | 13:30:00 | 14:00:00 | 14:30:00 | 15:00:00 | 15:30:00 | 16:00:00 | 16:30:00 | 
|------------|----------|----------|----------|----------|----------|----------|----------|----------| 
| 2011-01-01 | 2054.0   | 2071.0   | 2060.0   | 2054.0   | 2042.0   | 2064.0   | 2043.0   | 2089.0   | 
| 2011-01-02 | 2096.0   | 2038.0   | 2079.0   | 2052.0   | 2056.0   | 2092.0   | 2007.0   | 2008.0   | 
| 2011-01-03 | 2002.0   | 2083.0   | 2077.0   | 2087.0   | 2097.0   | 2079.0   | 2046.0   | 2078.0   | 
| 2011-01-04 | 2011.0   | 2063.0   | 2014.0   | 2094.0   | 2052.0   | 2041.0   | 2026.0   | 2077.0   | 
| 2011-01-05 | 2045.0   | 2056.0   | 2001.0   | 2061.0   | 2061.0   | 2061.0   | 2094.0   | 2068.0   | 
| 2011-01-06 | 2035.0   | 2043.0   | 2069.0   | 2006.0   | 2066.0   | 2067.0   | 2021.0   | 2012.0   | 
| 2011-01-07 | inf      | inf      | inf      | inf      | inf      | 2010.0   | 2020.0   | 2016.0   | 
| 2011-01-08 | 2065.0   | 2025.0   | 2046.0   | 2024.0   | 2015.0   | 2011.0   | 2065.0   | 2013.0   | 
| 2011-01-09 | 2019.0   | 2036.0   | 2082.0   | 2009.0   | 2083.0   | 2009.0   | 2097.0   | 2046.0   | 
| 2011-01-10 | 2097.0   | inf      | inf      | inf      | inf      | inf      | inf      | inf      | 

In order to get the result:

为了得到结果:

df.loc["2011-01-07": "2011-01-10", :].idxmin(axis=1)

2011-01-07    15:30:00
2011-01-08    15:30:00
2011-01-09    14:30:00
2011-01-10    13:00:00
Freq: D, dtype: object

#3


6  

A hacky way, but should be fast, is to concat the shifted DataFrames:

一种陈腐但应该很快的方法是使用移位的DataFrames:

In [11]: df.shift(1)
Out[11]:
            13:00:00  13:30:00  14:00:00  14:30:00  15:00:00  15:30:00  16:00:00  16:30:00
2011-01-01       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN
2011-01-02      2054      2071      2060      2054      2042      2064      2043      2089
2011-01-03      2096      2038      2079      2052      2056      2092      2007      2008
2011-01-04      2002      2083      2077      2087      2097      2079      2046      2078
2011-01-05      2011      2063      2014      2094      2052      2041      2026      2077
2011-01-06      2045      2056      2001      2061      2061      2061      2094      2068
2011-01-07      2035      2043      2069      2006      2066      2067      2021      2012
2011-01-08      2031      2036      2057      2043      2098      2010      2020      2016
2011-01-09      2065      2025      2046      2024      2015      2011      2065      2013
2011-01-10      2019      2036      2082      2009      2083      2009      2097      2046

In [12]: df.shift(2).iloc[:, 4:]
Out[12]:
            15:00:00  15:30:00  16:00:00  16:30:00
2011-01-01       NaN       NaN       NaN       NaN
2011-01-02       NaN       NaN       NaN       NaN
2011-01-03      2042      2064      2043      2089
2011-01-04      2056      2092      2007      2008
2011-01-05      2097      2079      2046      2078
2011-01-06      2052      2041      2026      2077
2011-01-07      2061      2061      2094      2068
2011-01-08      2066      2067      2021      2012
2011-01-09      2098      2010      2020      2016
2011-01-10      2015      2011      2065      2013

In [13]: pd.concat([df.iloc[:, :1], df.shift(1), df.shift(2).iloc[:, 4:]], axis=1)
Out[13]:
            13:00:00  13:00:00  13:30:00  14:00:00  14:30:00  15:00:00  15:30:00  16:00:00  16:30:00  15:00:00  15:30:00  16:00:00  16:30:00
2011-01-01      2054       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN
2011-01-02      2096      2054      2071      2060      2054      2042      2064      2043      2089       NaN       NaN       NaN       NaN
2011-01-03      2002      2096      2038      2079      2052      2056      2092      2007      2008      2042      2064      2043      2089
2011-01-04      2011      2002      2083      2077      2087      2097      2079      2046      2078      2056      2092      2007      2008
2011-01-05      2045      2011      2063      2014      2094      2052      2041      2026      2077      2097      2079      2046      2078
2011-01-06      2035      2045      2056      2001      2061      2061      2061      2094      2068      2052      2041      2026      2077
2011-01-07      2031      2035      2043      2069      2006      2066      2067      2021      2012      2061      2061      2094      2068
2011-01-08      2065      2031      2036      2057      2043      2098      2010      2020      2016      2066      2067      2021      2012
2011-01-09      2019      2065      2025      2046      2024      2015      2011      2065      2013      2098      2010      2020      2016
2011-01-10      2097      2019      2036      2082      2009      2083      2009      2097      2046      2015      2011      2065      2013

and take the minimum across the columns (making sure you discard the columns which are too early or too late on a given day:

在每一列中取最小值(确保在给定的一天丢弃那些太早或太晚的列:

In [14]: pd.concat([df.iloc[:, :1], df.shift(1), df.shift(2).iloc[:, 4:]], axis=1).min(1)
Out[14]:
2011-01-01    2054
2011-01-02    2042
2011-01-03    2002
2011-01-04    2002
2011-01-05    2011
2011-01-06    2001
2011-01-07    2006
2011-01-08    2010
2011-01-09    2010
2011-01-10    2009
Freq: D, dtype: float64

You can do this more efficiently, but more noisily, by taking the minimum of each shifted DataFrame before concatting:

你可以更有效地做到这一点,但更大声的是,在concatting之前,将每个移位的DataFrame的最小值都取下来:

In [21]: pd.concat([df.iloc[:, :1].min(1),
                    df.shift(1).min(1),
                    df.shift(2).iloc[:, 4:].min(1)],
                   axis=1).min(1)
Out[21]:
2011-01-01    2054
2011-01-02    2042
2011-01-03    2002
2011-01-04    2002
2011-01-05    2011
2011-01-06    2001
2011-01-07    2006
2011-01-08    2010
2011-01-09    2010
2011-01-10    2009
Freq: D, dtype: float64

Either will be significantly faster than looping through days.

这两种方法都比几天内循环要快得多。

#4


5  

I used pandas' stack() method and timeseries object to build the result from the sample data. This approach generalizes well to any arbitrary time range with a few adjustments, and uses pandas built in functionalies to build the result.

我使用了panda()方法和timeseries对象来从示例数据构建结果。这种方法可以很好地推广到任意时间范围,只需进行一些调整,并使用函数中内置的熊猫来构建结果。

import pandas as pd
import datetime as dt
# import df from json
df = pd.read_json('''{"13:00:00":     {"1293840000000":2085,"1293926400000":2062,"1294012800000":2035,"1294099200000":2086,"1294185600000":2006,"1294272000000":2097,"1294358400000":2078,"1294444800000":2055,"1294531200000":2023,"1294617600000":2024},
                      "13:30:00":{"1293840000000":2045,"1293926400000":2039,"1294012800000":2035,"1294099200000":2045,"1294185600000":2025,"1294272000000":2099,"1294358400000":2028,"1294444800000":2028,"1294531200000":2034,"1294617600000":2010},
                      "14:00:00":{"1293840000000":2095,"1293926400000":2006,"1294012800000":2001,"1294099200000":2032,"1294185600000":2022,"1294272000000":2040,"1294358400000":2024,"1294444800000":2070,"1294531200000":2081,"1294617600000":2095},
                      "14:30:00":{"1293840000000":2057,"1293926400000":2042,"1294012800000":2018,"1294099200000":2023,"1294185600000":2025,"1294272000000":2016,"1294358400000":2066,"1294444800000":2041,"1294531200000":2098,"1294617600000":2023},
                      "15:00:00":{"1293840000000":2082,"1293926400000":2025,"1294012800000":2040,"1294099200000":2061,"1294185600000":2013,"1294272000000":2063,"1294358400000":2024,"1294444800000":2036,"1294531200000":2096,"1294617600000":2068},
                      "15:30:00":{"1293840000000":2090,"1293926400000":2084,"1294012800000":2092,"1294099200000":2003,"1294185600000":2001,"1294272000000":2049,"1294358400000":2066,"1294444800000":2082,"1294531200000":2090,"1294617600000":2005},
                      "16:00:00":{"1293840000000":2081,"1293926400000":2003,"1294012800000":2009,"1294099200000":2001,"1294185600000":2011,"1294272000000":2098,"1294358400000":2051,"1294444800000":2092,"1294531200000":2029,"1294617600000":2073},
                      "16:30:00":{"1293840000000":2015,"1293926400000":2095,"1294012800000":2094,"1294099200000":2042,"1294185600000":2061,"1294272000000":2006,"1294358400000":2042,"1294444800000":2004,"1294531200000":2099,"1294617600000":2088}}
                   '''#,convert_axes=False
                    )
date_idx=df.index                    
# stack the data 
stacked = df.stack()
# merge the multindex into a single idx. 
idx_list = stacked.index.tolist()
idx = []
for item in idx_list:
    day = item[0]
    time = item[1]
    idx += [dt.datetime(day.year, day.month, day.day, time.hour, time.minute)]
# make a time series to simplify slicing
timeseries = pd.TimeSeries(stacked.values, index=idx)
# get the results for each date

for i in range(2, len(date_idx)):
    # get the min values for each day in the sample data. 
    start_time='%s 15:00:00'%date_idx[i-2]
    end_time = '%s 13:30:00'%date_idx[i]
    slice_idx =timeseries.index>=start_time 
    slice_idx *= timeseries.index<=end_time
    print "%s %s"%(date_idx[i].date(), timeseries[slice_idx].min())

output:

输出:

2011-01-03 2003
2011-01-04 2001
2011-01-05 2001
2011-01-06 2001
2011-01-07 2001
2011-01-08 2006
2011-01-09 2004
2011-01-10 2004