python pandas试图减少对循环的依赖

时间:2021-02-06 15:50:22

This is a general question, but I will use an example to help ask the question. I have a dataframe (df) with df[col_1] = all true or false. In df[col_2], I would like to return another true or false based on if the prior 5 rows of column 1 ( df[col_1][i-6:i-1] ) contain a match for df[col_1][i].

这是一个普遍的问题,但我会用一个例子来帮助提出这个问题。我有一个数据帧(df)与df [col_1] =所有true或false。在df [col_2]中,我想根据第1列的先前5行(df [col_1] [i-6:i-1])是否包含df [col_1] [i]的匹配来返回另一个真或假]。

This is the loop I am using now, but it is one of many so I think they must be slowing things down as I increase data.

这是我现在使用的循环,但它是众多循环中的一个,所以我认为它们必须在增加数据时减慢速度。

for i in df.index:
  if i < 6:
    df[col_2][i] = 0.
  else:
    df[col_2][i] = df[col_1][i] not in tuple(df[col_1].ix[i-6:i-1,col_1)

Should look like this:

应该是这样的:

.   col_1   col_2
0   TRUE    
1   TRUE    
2   TRUE    
3   TRUE    
4   FALSE   
5   FALSE   FALSE
6   FALSE   FALSE
7   FALSE   FALSE
8   FALSE   FALSE
9   TRUE    TRUE
10  FALSE   FALSE
11  FALSE   FALSE
12  FALSE   FALSE
13  FALSE   FALSE
14  TRUE    FALSE
15  TRUE    FALSE
16  TRUE    FALSE
17  TRUE    FALSE
18  TRUE    FALSE
19  TRUE    FALSE
20  FALSE   TRUE

I am wondering if there is a way to do something clever (or basic) with pandas to make use of vectorization - maybe using shift or an offset function?

我想知道是否有办法用熊猫做一些聪明的(或基本的)来利用矢量化 - 可能使用移位或偏移功能?

I hope I haven't missed an answer that already exists - I wasn't exactly sure how to phrase the question. Thanks in advance.

我希望我没有错过已经存在的答案 - 我不确定如何表达这个问题。提前致谢。

1 个解决方案

#1


2  

Here's a simple vectorized solution that should be pretty fast, although there is probably a more elegant way to write it. You can just ignore the first 5 rows or overwrite them to NaN if you prefer.

这是一个简单的矢量化解决方案应该非常快,尽管可能有更优雅的方式来编写它。如果您愿意,可以忽略前5行或将其覆盖为NaN。

df = pd.DataFrame({ 'col_1':[True,True,True,True,False,False,False,False,
                             False,True,False,False,False,False,True,True,
                             True,True,True,True,False] })

df['col_2'] = ((df!=df.shift(1)) & (df!=df.shift(2)) & (df!=df.shift(3)) & 
               (df!=df.shift(4)) & (df!=df.shift(5)))

If speed really matters, you could do something like the following. It's more than 3x faster than the above and probably about as efficient as you can do here. This is just using the fact that rolling_sum() will interpret booleans as 0/1 and you just need to know if the sum is 0 or 5.

如果速度真的很重要,你可以做类似以下的事情。它比上面快3倍以上,可能和你在这里做的效率差不多。这只是使用了rolling_sum()将布尔值解释为0/1的事实,你只需要知道总和是0还是5。

df['rollsum'] = pd.rolling_sum(df.col_1,6) - df.col_1
df['col_3'] = ( ((df.col_1==True ) & (df.rollsum==0)) 
              | ((df.col_1==False) & (df.rollsum==5)) )

    col_1  col_2  rollsum  col_3
0    True   True      NaN  False
1    True  False      NaN  False
2    True  False      NaN  False
3    True  False      NaN  False
4   False   True      NaN  False
5   False  False        4  False
6   False  False        3  False
7   False  False        2  False
8   False  False        1  False
9    True   True        0   True
10  False  False        1  False
11  False  False        1  False
12  False  False        1  False
13  False  False        1  False
14   True  False        1  False
15   True  False        1  False
16   True  False        2  False
17   True  False        3  False
18   True  False        4  False
19   True  False        5  False
20  False   True        5   True

#1


2  

Here's a simple vectorized solution that should be pretty fast, although there is probably a more elegant way to write it. You can just ignore the first 5 rows or overwrite them to NaN if you prefer.

这是一个简单的矢量化解决方案应该非常快,尽管可能有更优雅的方式来编写它。如果您愿意,可以忽略前5行或将其覆盖为NaN。

df = pd.DataFrame({ 'col_1':[True,True,True,True,False,False,False,False,
                             False,True,False,False,False,False,True,True,
                             True,True,True,True,False] })

df['col_2'] = ((df!=df.shift(1)) & (df!=df.shift(2)) & (df!=df.shift(3)) & 
               (df!=df.shift(4)) & (df!=df.shift(5)))

If speed really matters, you could do something like the following. It's more than 3x faster than the above and probably about as efficient as you can do here. This is just using the fact that rolling_sum() will interpret booleans as 0/1 and you just need to know if the sum is 0 or 5.

如果速度真的很重要,你可以做类似以下的事情。它比上面快3倍以上,可能和你在这里做的效率差不多。这只是使用了rolling_sum()将布尔值解释为0/1的事实,你只需要知道总和是0还是5。

df['rollsum'] = pd.rolling_sum(df.col_1,6) - df.col_1
df['col_3'] = ( ((df.col_1==True ) & (df.rollsum==0)) 
              | ((df.col_1==False) & (df.rollsum==5)) )

    col_1  col_2  rollsum  col_3
0    True   True      NaN  False
1    True  False      NaN  False
2    True  False      NaN  False
3    True  False      NaN  False
4   False   True      NaN  False
5   False  False        4  False
6   False  False        3  False
7   False  False        2  False
8   False  False        1  False
9    True   True        0   True
10  False  False        1  False
11  False  False        1  False
12  False  False        1  False
13  False  False        1  False
14   True  False        1  False
15   True  False        1  False
16   True  False        2  False
17   True  False        3  False
18   True  False        4  False
19   True  False        5  False
20  False   True        5   True