用熊猫进行数据查找最有效的方法是什么?

时间:2022-11-11 22:57:10

I want to perform my own complex operations on financial data in dataframes in a sequential manner.

我想以一种顺序的方式对数据aframes中的财务数据执行我自己的复杂操作。

For example I am using the following MSFT CSV file taken from Yahoo Finance:

例如,我正在使用雅虎财经的MSFT CSV文件:

Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27

....

I then do the following:

然后我做如下的工作:

#!/usr/bin/env python
from pandas import *

df = read_csv('table.csv')

for i, row in enumerate(df.values):
    date = df.index[i]
    open, high, low, close, adjclose = row
    #now perform analysis on open/close based on date, etc..

Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.

这是最有效的方法吗?考虑到对熊猫速度的关注,我假设一定有一些特殊的函数来迭代这些值,其方式也可以检索索引(可能通过生成器来提高内存效率)?df。不幸的是,iteritems只遍历列。

10 个解决方案

#1


280  

The newest versions of pandas now include a built-in function for iterating over rows.

熊猫的最新版本现在包含了一个内建函数,用于对行进行迭代。

for index, row in df.iterrows():

    # do some logic here

Or, if you want it faster use itertuples()

或者,如果想要更快,可以使用itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

但是,unutbu建议使用numpy函数来避免对行进行迭代,这将生成最快的代码。

#2


129  

Pandas is based on NumPy arrays. The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.

熊猫是基于数字数组。使用NumPy数组加速的关键是一次对整个数组执行操作,而不是逐行或逐项执行。

For example, if close is a 1-d array, and you want the day-over-day percent change,

例如,如果close是一个一维数组,你想要日比日变化百分比,

pct_change = close[1:]/close[:-1]

This computes the entire array of percent changes as one statement, instead of

这将计算整个百分比变化的数组作为一个语句,而不是

pct_change = []
for row in close:
    pct_change.append(...)

So try to avoid the Python loop for i, row in enumerate(...) entirely, and think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

因此,尽量避免对i执行Python循环,将row in enumerate(…)全部包含在内,并考虑如何将对整个数组(或dataframe)的操作作为一个整体执行计算,而不是逐行执行。

#3


69  

You can loop through the rows by transposing and then calling iteritems:

您可以通过换位然后调用iteritems来循环这些行:

for date, row in df.T.iteritems():
   # do some logic here

I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:

我对那种情况下的效率没有把握。为了在迭代算法中获得最好的性能,您可能需要研究如何在Cython中编写它,因此您可以做以下事情:

def my_algo(ndarray[object] dates, ndarray[float64_t] open,
            ndarray[float64_t] low, ndarray[float64_t] high,
            ndarray[float64_t] close, ndarray[float64_t] volume):
    cdef:
        Py_ssize_t i, n
        float64_t foo
    n = len(dates)

    for i from 0 <= i < n:
        foo = close[i] - open[i] # will be extremely fast

I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.

我建议先用纯Python编写算法,确保它能工作,看看它有多快——如果它不够快,就像这样把东西转换成Cython,只需要做很少的工作,就可以得到与手工编码的C/ c++差不多快的东西。

#4


64  

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.

就像前面提到的那样,在一次处理整个数组时,熊猫对象是最有效的。然而,对于那些真正需要通过一个熊猫DataFrame来执行某些操作的人来说,我找到了至少三种方法。我做了一个简短的测试,看看哪一个最省时。

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(time.time()-A)

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(time.time()-A)

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(time.time()-A)

print B

Result:

结果:

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

This is probably not the best way to measure the time consumption but it's quick for me.

这可能不是衡量时间消耗的最好方法,但对我来说它是快速的。

Here are some pros and cons IMHO:

以下是一些利与弊:

  • .iterrows(): return index and row items in separate variables, but significantly slower
  • .iterrows():返回不同变量中的索引和行项,但速度要慢得多
  • .itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
  • .itertuples():比.iterrows()更快,但是返回索引和行项,ir[0]是索引
  • zip: quickest, but no access to index of the row
  • zip:最快的,但是不能访问行索引

#5


22  

I checked out iterrows after noticing Nick Crawford's answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1...) tuples.

在注意到Nick Crawford的答案后,我检查了迭代,但是发现它会产生(索引、系列)元组。我不确定哪种方法最适合您,但是我最终使用itertuples方法解决我的问题,它会产生(index, row_value1…)元组。

There's also iterkv, which iterates through (column, series) tuples.

还有iterkv,它遍历(列、系列)元组。

#6


17  

Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:

就像一个小的加法一样,如果你有一个复杂的函数,你可以把它应用到单个列中:

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html

df[b] = df[a].apply(lambda col: do stuff with col here)

#7


7  

You have three options:

你有三个选择:

By index (simplest):

通过索引(简单的):

>>> for index in df.index:
...     print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))

With iterrows (most used):

与iterrows(最常用):

>>> for index, row in df.iterrows():
...     print ("df[" + str(index) + "]['B']=" + str(row['B']))

With itertuples (fastest):

与itertuples(最快的):

>>> for row in df.itertuples():
...     print ("df[" + str(row.Index) + "]['B']=" + str(row.B))

Three options display something like:

三个选项显示如下内容:

df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12

Source: neural-networks.io

来源:neural-networks.io

#8


3  

As @joris pointed out, iterrows is much slower than itertuples and itertuples is approximately 100 times fater than iterrows, and I tested speed of both methods in a DataFrame with 5027505 records the result is for iterrows, it is 1200it/s, and itertuples is 120000it/s.

正如@joris指出的,迭代比迭代慢得多,迭代比迭代快大约100倍,我在一个5027505记录的DataFrame中测试了这两个方法的速度,结果是迭代行,结果是1200it/s,迭代元是1200it/s。

If you use itertuples, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code

如果使用itertuples,请注意for循环中的每个元素都是一个namedtuple,因此要在每个列中获取值,可以参考下面的示例代码

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
                      index=['a', 'b'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():
...     print(row.col1, row.col2)
...
1, 0.1
2, 0.2

#9


2  

Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.

另一个建议是将groupby与向量化的计算相结合,如果这些行的子集具有相同的特性,那么您可以这样做。

#10


0  

For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via df.values (as you do) or by accessing each column separately df.column_name.values. Since you want to have access to the index too, you can use df.index.values for that.

当然,迭代dataframe的最快方式是通过df访问底层的numpy ndarray。值(如您所做的)或通过分别访问每个列df.column_name.values。因为您也想访问索引,所以可以使用df.index。值。

index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values

for i in range(df.shape[0]):
   index_value = index[i]
   ...
   column_value_k = column_of_interest_k[i]

Not pythonic? Sure. But fast.

不是神谕的吗?确定。但很快。

If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.

如果你想从环中挤出更多的汁液,你就得研究一下cython。Cython会让你获得巨大的加速(想想10x-100x)。为了获得最大的性能,请检查cython的内存视图。

#1


280  

The newest versions of pandas now include a built-in function for iterating over rows.

熊猫的最新版本现在包含了一个内建函数,用于对行进行迭代。

for index, row in df.iterrows():

    # do some logic here

Or, if you want it faster use itertuples()

或者,如果想要更快,可以使用itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

但是,unutbu建议使用numpy函数来避免对行进行迭代,这将生成最快的代码。

#2


129  

Pandas is based on NumPy arrays. The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.

熊猫是基于数字数组。使用NumPy数组加速的关键是一次对整个数组执行操作,而不是逐行或逐项执行。

For example, if close is a 1-d array, and you want the day-over-day percent change,

例如,如果close是一个一维数组,你想要日比日变化百分比,

pct_change = close[1:]/close[:-1]

This computes the entire array of percent changes as one statement, instead of

这将计算整个百分比变化的数组作为一个语句,而不是

pct_change = []
for row in close:
    pct_change.append(...)

So try to avoid the Python loop for i, row in enumerate(...) entirely, and think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

因此,尽量避免对i执行Python循环,将row in enumerate(…)全部包含在内,并考虑如何将对整个数组(或dataframe)的操作作为一个整体执行计算,而不是逐行执行。

#3


69  

You can loop through the rows by transposing and then calling iteritems:

您可以通过换位然后调用iteritems来循环这些行:

for date, row in df.T.iteritems():
   # do some logic here

I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:

我对那种情况下的效率没有把握。为了在迭代算法中获得最好的性能,您可能需要研究如何在Cython中编写它,因此您可以做以下事情:

def my_algo(ndarray[object] dates, ndarray[float64_t] open,
            ndarray[float64_t] low, ndarray[float64_t] high,
            ndarray[float64_t] close, ndarray[float64_t] volume):
    cdef:
        Py_ssize_t i, n
        float64_t foo
    n = len(dates)

    for i from 0 <= i < n:
        foo = close[i] - open[i] # will be extremely fast

I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.

我建议先用纯Python编写算法,确保它能工作,看看它有多快——如果它不够快,就像这样把东西转换成Cython,只需要做很少的工作,就可以得到与手工编码的C/ c++差不多快的东西。

#4


64  

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.

就像前面提到的那样,在一次处理整个数组时,熊猫对象是最有效的。然而,对于那些真正需要通过一个熊猫DataFrame来执行某些操作的人来说,我找到了至少三种方法。我做了一个简短的测试,看看哪一个最省时。

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(time.time()-A)

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(time.time()-A)

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(time.time()-A)

print B

Result:

结果:

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

This is probably not the best way to measure the time consumption but it's quick for me.

这可能不是衡量时间消耗的最好方法,但对我来说它是快速的。

Here are some pros and cons IMHO:

以下是一些利与弊:

  • .iterrows(): return index and row items in separate variables, but significantly slower
  • .iterrows():返回不同变量中的索引和行项,但速度要慢得多
  • .itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
  • .itertuples():比.iterrows()更快,但是返回索引和行项,ir[0]是索引
  • zip: quickest, but no access to index of the row
  • zip:最快的,但是不能访问行索引

#5


22  

I checked out iterrows after noticing Nick Crawford's answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1...) tuples.

在注意到Nick Crawford的答案后,我检查了迭代,但是发现它会产生(索引、系列)元组。我不确定哪种方法最适合您,但是我最终使用itertuples方法解决我的问题,它会产生(index, row_value1…)元组。

There's also iterkv, which iterates through (column, series) tuples.

还有iterkv,它遍历(列、系列)元组。

#6


17  

Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:

就像一个小的加法一样,如果你有一个复杂的函数,你可以把它应用到单个列中:

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html

df[b] = df[a].apply(lambda col: do stuff with col here)

#7


7  

You have three options:

你有三个选择:

By index (simplest):

通过索引(简单的):

>>> for index in df.index:
...     print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))

With iterrows (most used):

与iterrows(最常用):

>>> for index, row in df.iterrows():
...     print ("df[" + str(index) + "]['B']=" + str(row['B']))

With itertuples (fastest):

与itertuples(最快的):

>>> for row in df.itertuples():
...     print ("df[" + str(row.Index) + "]['B']=" + str(row.B))

Three options display something like:

三个选项显示如下内容:

df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12

Source: neural-networks.io

来源:neural-networks.io

#8


3  

As @joris pointed out, iterrows is much slower than itertuples and itertuples is approximately 100 times fater than iterrows, and I tested speed of both methods in a DataFrame with 5027505 records the result is for iterrows, it is 1200it/s, and itertuples is 120000it/s.

正如@joris指出的,迭代比迭代慢得多,迭代比迭代快大约100倍,我在一个5027505记录的DataFrame中测试了这两个方法的速度,结果是迭代行,结果是1200it/s,迭代元是1200it/s。

If you use itertuples, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code

如果使用itertuples,请注意for循环中的每个元素都是一个namedtuple,因此要在每个列中获取值,可以参考下面的示例代码

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
                      index=['a', 'b'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():
...     print(row.col1, row.col2)
...
1, 0.1
2, 0.2

#9


2  

Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.

另一个建议是将groupby与向量化的计算相结合,如果这些行的子集具有相同的特性,那么您可以这样做。

#10


0  

For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via df.values (as you do) or by accessing each column separately df.column_name.values. Since you want to have access to the index too, you can use df.index.values for that.

当然,迭代dataframe的最快方式是通过df访问底层的numpy ndarray。值(如您所做的)或通过分别访问每个列df.column_name.values。因为您也想访问索引,所以可以使用df.index。值。

index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values

for i in range(df.shape[0]):
   index_value = index[i]
   ...
   column_value_k = column_of_interest_k[i]

Not pythonic? Sure. But fast.

不是神谕的吗?确定。但很快。

If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.

如果你想从环中挤出更多的汁液,你就得研究一下cython。Cython会让你获得巨大的加速(想想10x-100x)。为了获得最大的性能,请检查cython的内存视图。