Pandas Dataframe Performance应用了shift功能

时间:2022-08-25 19:33:54

I'm trying to optimize my code a bit. One call is pretty fast, but since it is often I got some issues.

我正在尝试优化我的代码。一个电话很快,但因为我经常遇到一些问题。

My input data looks like this:

我的输入数据如下所示:

df = pd.DataFrame(data=np.random.randn(30),
                  index=pd.date_range(pd.datetime(2016,1,1), periods = 30))
df.iloc[:20] = np.nan

Now I just want to apply a simple function. Here is the part I want to optimize:

现在我只想应用一个简单的函数。这是我想要优化的部分:

s = df >= df.shift(1)
s = s.applymap(lambda x: 1 if x else 0)

Right now I'm getting 1000 loops, best of 3: 1.36 ms per loop. I guess it should be possible to do it much faster. Not sure if I should vectorize, work only with numpy or maybe use cython. Any Idea for the best approach? I struggle a bit with the shift operator.

现在我得到1000个循环,最好的3:每循环1.36毫秒。我想应该可以更快地做到这一点。不确定我是否应该进行矢量化,只能使用numpy或者使用cython。最佳方法的任何想法?我与班次运营商有点挣扎。

3 个解决方案

#1


1  

You can cast the result of your comparison directly from bool to int:

您可以将比较结果直接从bool转换为int:

(df >= df.shift(1)).astype(int)

#2


0  

@Paul H's answer is good, performant and what I'd generally recommend.

@Paul H的答案是好的,高效的,我通常会推荐的。

That said, if you want to squeeze every last bit of performance, this is a decent candidate for numba which you can use to compute the answer in a single pass over the data.

也就是说,如果你想要挤出最后一点性能,这对于numba来说是一个不错的选择,你可以用它来计算一次通过数据的答案。

from numba import njit

@njit
def do_calc(arr):
    N = arr.shape[0]
    ans = np.empty(N, dtype=np.int_)
    ans[0] = 0
    for i in range(1, N):
        ans[i] = 1 if arr[i] > arr[i-1] else 0
    return ans

a = (df >= df.shift(1)).astype(int)
b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))

from pandas.testing import assert_frame_equal
assert_frame_equal(a, b)

Here are timings

这是时间

In [45]: %timeit b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))
135 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [46]: %timeit a = (df >= df.shift(1)).astype(int)
762 µs ± 22.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#3


0  

Thats my current best solution:

这是我目前最好的解决方案:

values = df.values[1:] >= df.values[:-1] 
data = np.array(values, dtype=int)
s  = pd.DataFrame(data, df.index[1:])

I'm getting 10000 loops, best of 3: 125 µs per loop. x10 improvement. But I think it could be done even faster.

我得到10000个循环,最好是每个循环3:125μs。 x10改进。但我认为它可以做得更快。

PS: this solution isn't exactly correct since the first zero / nan is missing. PPS: that can be corrected by pd.DataFrame(np.append([[0]],data), df.index)

PS:这个解决方案不完全正确,因为缺少第一个零/ nan。 PPS:可以通过pd.DataFrame(np.append([[0]],data),df.index)来纠正

#1


1  

You can cast the result of your comparison directly from bool to int:

您可以将比较结果直接从bool转换为int:

(df >= df.shift(1)).astype(int)

#2


0  

@Paul H's answer is good, performant and what I'd generally recommend.

@Paul H的答案是好的,高效的,我通常会推荐的。

That said, if you want to squeeze every last bit of performance, this is a decent candidate for numba which you can use to compute the answer in a single pass over the data.

也就是说,如果你想要挤出最后一点性能,这对于numba来说是一个不错的选择,你可以用它来计算一次通过数据的答案。

from numba import njit

@njit
def do_calc(arr):
    N = arr.shape[0]
    ans = np.empty(N, dtype=np.int_)
    ans[0] = 0
    for i in range(1, N):
        ans[i] = 1 if arr[i] > arr[i-1] else 0
    return ans

a = (df >= df.shift(1)).astype(int)
b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))

from pandas.testing import assert_frame_equal
assert_frame_equal(a, b)

Here are timings

这是时间

In [45]: %timeit b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))
135 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [46]: %timeit a = (df >= df.shift(1)).astype(int)
762 µs ± 22.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#3


0  

Thats my current best solution:

这是我目前最好的解决方案:

values = df.values[1:] >= df.values[:-1] 
data = np.array(values, dtype=int)
s  = pd.DataFrame(data, df.index[1:])

I'm getting 10000 loops, best of 3: 125 µs per loop. x10 improvement. But I think it could be done even faster.

我得到10000个循环,最好是每个循环3:125μs。 x10改进。但我认为它可以做得更快。

PS: this solution isn't exactly correct since the first zero / nan is missing. PPS: that can be corrected by pd.DataFrame(np.append([[0]],data), df.index)

PS:这个解决方案不完全正确,因为缺少第一个零/ nan。 PPS:可以通过pd.DataFrame(np.append([[0]],data),df.index)来纠正