I'm trying to optimize my code a bit. One call is pretty fast, but since it is often I got some issues.
我正在尝试优化我的代码。一个电话很快,但因为我经常遇到一些问题。
My input data looks like this:
我的输入数据如下所示:
df = pd.DataFrame(data=np.random.randn(30),
index=pd.date_range(pd.datetime(2016,1,1), periods = 30))
df.iloc[:20] = np.nan
Now I just want to apply a simple function. Here is the part I want to optimize:
现在我只想应用一个简单的函数。这是我想要优化的部分:
s = df >= df.shift(1)
s = s.applymap(lambda x: 1 if x else 0)
Right now I'm getting 1000 loops, best of 3: 1.36 ms per loop. I guess it should be possible to do it much faster. Not sure if I should vectorize, work only with numpy or maybe use cython. Any Idea for the best approach? I struggle a bit with the shift operator.
现在我得到1000个循环,最好的3:每循环1.36毫秒。我想应该可以更快地做到这一点。不确定我是否应该进行矢量化,只能使用numpy或者使用cython。最佳方法的任何想法?我与班次运营商有点挣扎。
3 个解决方案
#1
1
You can cast the result of your comparison directly from bool
to int
:
您可以将比较结果直接从bool转换为int:
(df >= df.shift(1)).astype(int)
#2
0
@Paul H's answer is good, performant and what I'd generally recommend.
@Paul H的答案是好的,高效的,我通常会推荐的。
That said, if you want to squeeze every last bit of performance, this is a decent candidate for numba
which you can use to compute the answer in a single pass over the data.
也就是说,如果你想要挤出最后一点性能,这对于numba来说是一个不错的选择,你可以用它来计算一次通过数据的答案。
from numba import njit
@njit
def do_calc(arr):
N = arr.shape[0]
ans = np.empty(N, dtype=np.int_)
ans[0] = 0
for i in range(1, N):
ans[i] = 1 if arr[i] > arr[i-1] else 0
return ans
a = (df >= df.shift(1)).astype(int)
b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))
from pandas.testing import assert_frame_equal
assert_frame_equal(a, b)
Here are timings
这是时间
In [45]: %timeit b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))
135 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [46]: %timeit a = (df >= df.shift(1)).astype(int)
762 µs ± 22.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#3
0
Thats my current best solution:
这是我目前最好的解决方案:
values = df.values[1:] >= df.values[:-1]
data = np.array(values, dtype=int)
s = pd.DataFrame(data, df.index[1:])
I'm getting 10000 loops, best of 3: 125 µs per loop. x10 improvement. But I think it could be done even faster.
我得到10000个循环,最好是每个循环3:125μs。 x10改进。但我认为它可以做得更快。
PS: this solution isn't exactly correct since the first zero / nan is missing. PPS: that can be corrected by pd.DataFrame(np.append([[0]],data), df.index)
PS:这个解决方案不完全正确,因为缺少第一个零/ nan。 PPS:可以通过pd.DataFrame(np.append([[0]],data),df.index)来纠正
#1
1
You can cast the result of your comparison directly from bool
to int
:
您可以将比较结果直接从bool转换为int:
(df >= df.shift(1)).astype(int)
#2
0
@Paul H's answer is good, performant and what I'd generally recommend.
@Paul H的答案是好的,高效的,我通常会推荐的。
That said, if you want to squeeze every last bit of performance, this is a decent candidate for numba
which you can use to compute the answer in a single pass over the data.
也就是说,如果你想要挤出最后一点性能,这对于numba来说是一个不错的选择,你可以用它来计算一次通过数据的答案。
from numba import njit
@njit
def do_calc(arr):
N = arr.shape[0]
ans = np.empty(N, dtype=np.int_)
ans[0] = 0
for i in range(1, N):
ans[i] = 1 if arr[i] > arr[i-1] else 0
return ans
a = (df >= df.shift(1)).astype(int)
b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))
from pandas.testing import assert_frame_equal
assert_frame_equal(a, b)
Here are timings
这是时间
In [45]: %timeit b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))
135 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [46]: %timeit a = (df >= df.shift(1)).astype(int)
762 µs ± 22.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#3
0
Thats my current best solution:
这是我目前最好的解决方案:
values = df.values[1:] >= df.values[:-1]
data = np.array(values, dtype=int)
s = pd.DataFrame(data, df.index[1:])
I'm getting 10000 loops, best of 3: 125 µs per loop. x10 improvement. But I think it could be done even faster.
我得到10000个循环,最好是每个循环3:125μs。 x10改进。但我认为它可以做得更快。
PS: this solution isn't exactly correct since the first zero / nan is missing. PPS: that can be corrected by pd.DataFrame(np.append([[0]],data), df.index)
PS:这个解决方案不完全正确,因为缺少第一个零/ nan。 PPS:可以通过pd.DataFrame(np.append([[0]],data),df.index)来纠正