使用前一行中的两列来确定pandas数据框中的列值

I need to calculate a value for each row in a Pandas data frame by comparing two columns to the values of the same columns for the previous row. I was able to do this by using iloc, but it takes a really long time when applying it to over 100K rows.

我需要通过将两列与前一行的相同列的值进行比较来计算Pandas数据帧中每一行的值。我能够通过使用iloc来做到这一点,但是将它应用于超过100K的行需要很长时间。

I tried using lambda, but it seems that it only returns one row or one column at the time, so I can't use it to compare multiple columns and rows at the same time.

我尝试使用lambda,但它似乎只返回一行或一列,所以我不能用它来同时比较多个列和行。

In this example, I subtract the value of 'b' for the previous row from the value of 'b' for the current row, but only if the value of 'a' is the same for both rows.

在此示例中,我从当前行的'b'值中减去前一行的'b'值,但前提是两行的'a'值相同。

This is the code I've been using:

这是我一直在使用的代码:

import pandas as pd
df = pd.DataFrame({'a':['a','a','b','b','b'],'b':[1,2,3,4,5]})

df['increase'] = 0
for row in range(len(df)):
    if row > 0:
        if df.iloc[row]['a'] == df.iloc[row - 1]['a']:
           df.iloc[row, 2] = df.iloc[row]['b'] - df.iloc[row - 1]['b']

is there a faster way to do the same calculation?

是否有更快的方法进行相同的计算?

Thanks.

2 个解决方案

#1

IIUC, you can suing groupby +diff

IIUC,你可以起诉groupby + diff

df.groupby('a').b.diff().fillna(0)   
Out[193]: 
0    0.0
1    1.0
2    0.0
3    1.0
4    1.0
Name: b, dtype: float64

After assign it back

分配后

df['increase']=df.groupby('a').b.diff().fillna(0)
df
Out[198]: 
   a  b  increase
0  a  1       0.0
1  a  2       1.0
2  b  3       0.0
3  b  4       1.0
4  b  5       1.0

#2

Here is one solution:

这是一个解决方案:

df['increase'] = [0] + [(d - c) if a == b else 0 for a, b, c, d in \
                        zip(df.a, df.a[1:], df.b, df.b[1:])]

Some benchmarking vs @Wen's pandonic solution:

一些基准测试vs @ Wen的pandonic解决方案:

df = pd.DataFrame({'a':['a','a','b','b','b']*20000,'b':[1,2,3,4,5]*20000})

%timeit [0] + [(d - c) if a == b else 0 for a, b, c, d in zip(df.a, df.a[1:], df.b, df.b[1:])]
# 51.6 ms ± 898 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.groupby('a').b.diff().fillna(0)
# 37.8 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#1