在循环中创建NumPy数组的最有效和最pythonic的方法

时间:2021-02-22 20:22:20

I'm currently trying to figure out the most efficient way to create a numpy array in a loop, here are the examples:

我目前正在尝试找出在循环中创建numpy数组的最有效方法,以下是示例:

import numpy as np
from time import time
tic = time()
my_list = range(1000000)
a = np.zeros((len(my_list),))
for i in my_list:
   a[i] = i
toc = time()
print(toc-tic)

vs

tic = time()
a = []
my_list = range(1000000)
for i in my_list:
    a.append(i)
a = np.array(a)
toc = time()

print(toc-tic)

I was expecting that the second one would be much slower than the first one, because of the need of new memory at each step of the for loop, however these are roughly the same and I was wondering why, but just for curiosity because I can do it with both.

我期待第二个比第一个慢得多,因为在for循环的每个步骤都需要新的内存,但是这些大致相同,我想知道为什么,但只是为了好奇,因为我可以做到这两点。

I actually want to write a simple numpy array with data extracted from a dataframe and it looks quite messy. I was wondering if there would be a more pythonic way to do it. I have this dataframe and a list of labels that I need and the simpliest idea would be to do the following (the value I need is the last one of each column):

我实际上想要编写一个简单的numpy数组,其中包含从数据帧中提取的数据,它看起来非常混乱。我想知道是否会有更多的pythonic方式来做到这一点。我有这个数据框和我需要的标签列表,最简单的想法是执行以下操作(我需要的值是每列的最后一个):

vars_outputs = ["x1", "x2", "ratio_x1_x2"]
my_df = pd.read_excel(path)
outpts = np.array(my_df[vars_outputs][-1])

However it is not possible because some of the labels I want are not directly available in the dataframe : for example the ratio_x1_x2 need to be computed from the two first columns. So I added a dict with the missing label and the way to compute them (it's only ratio):

但是这是不可能的,因为我想要的一些标签在数据帧中不能直接使用:例如,需要从两个第一列计算ratio_x1_x2。所以我添加了一个带有缺失标签的dict和计算它们的方式(它只是比率):

missing_labels = {"ratio_x1_x2" : ["x1", "x2"]}

and check the condition and create the numpy array (hence the previous question about efficiency)

并检查条件并创建numpy数组(因此前一个关于效率的问题)

outpts = []
for var in vars_outputs:
    if var in missing_labels.keys():
        outpts.append(my_df[missing_labels[var][0]][-1]/my_df[missing_labels[var][1]][-1])
    else:
        outpts.append(my_df[var][-1])
outpts = np.array(outpts)

It seems to me way too complicated but I cannot think of an easier way to do so (especially because I need to have this specific order in my numpy output array)

在我看来太复杂,但我想不出更容易的方法(特别是因为我需要在我的numpy输出数组中有这个特定的顺序)

The other idea I have is to add columns in the dataframe with the operation I want but because there are roughly 8000 labels I don't know if it's the best to do because I would have to look into all these labels after this preprocessing step

我的另一个想法是使用我想要的操作在数据框中添加列但是因为有大约8000个标签我不知道它是否是最好的,因为在这个预处理步骤之后我将不得不查看所有这些标签

Thanks a lot

非常感谢

2 个解决方案

#1


0  

Thanks @hpaulj, that might be very useful for me in future. I wasn't aware of the speed up using fromiter()

谢谢@hpaulj,这对我将来可能非常有用。我不知道使用fromiter()加速

import timeit
setup = '''
import numpy as np
H, W = 400, 400
it = [(1 + 1 / (i + 0.5)) ** 2 for i in range(W) for j in range(H)]'''

fns = ['''
x = np.array([[(1 + 1 / (i + 0.5)) ** 2 for i in range(W)] for j in range(H)])
''', '''
x = np.fromiter(it, np.float)
x.reshape(H, W)
''']
for f in fns:
  print(timeit.timeit(f,setup=setup, number=100))
# gives me
# 6.905218548999983
# 0.5763416080008028

EDIT PS your for loop could be some kind of iterator like

编辑PS你的for循环可能是某种迭代器

it = [my_df[missing_labels[var][0]][-1]
        / my_df[missing_labels[var][1]][-1] if var in missing_labels
        else my_df[var][-1] for var in var_outputs]

#2


1  

Here is the final code, np.fromiter() does the trick and allows to reduce the number of lines by using list comprehension

这是最终的代码,np.fromiter()完成这个技巧并允许通过使用列表理解来减少行数

df = pd.read_excel(path)
print(df.columns)

It outputs ['x1', 'x2']

它输出['x1','x2']

vars_outputs = ["x1", "x2", "ratio_x1_x2"]
missing_labels = {"ratio_x1_x2" : ["x1", "x2"]}

it = [df[missing_labels[var][0]].iloc[-1]/df[missing_labels[var][1]].iloc[-1] if var in missing_labels
        else df[var].iloc[-1] for var in vars_outputs]

t = np.fromiter(it, dtype = float)

#1


0  

Thanks @hpaulj, that might be very useful for me in future. I wasn't aware of the speed up using fromiter()

谢谢@hpaulj,这对我将来可能非常有用。我不知道使用fromiter()加速

import timeit
setup = '''
import numpy as np
H, W = 400, 400
it = [(1 + 1 / (i + 0.5)) ** 2 for i in range(W) for j in range(H)]'''

fns = ['''
x = np.array([[(1 + 1 / (i + 0.5)) ** 2 for i in range(W)] for j in range(H)])
''', '''
x = np.fromiter(it, np.float)
x.reshape(H, W)
''']
for f in fns:
  print(timeit.timeit(f,setup=setup, number=100))
# gives me
# 6.905218548999983
# 0.5763416080008028

EDIT PS your for loop could be some kind of iterator like

编辑PS你的for循环可能是某种迭代器

it = [my_df[missing_labels[var][0]][-1]
        / my_df[missing_labels[var][1]][-1] if var in missing_labels
        else my_df[var][-1] for var in var_outputs]

#2


1  

Here is the final code, np.fromiter() does the trick and allows to reduce the number of lines by using list comprehension

这是最终的代码,np.fromiter()完成这个技巧并允许通过使用列表理解来减少行数

df = pd.read_excel(path)
print(df.columns)

It outputs ['x1', 'x2']

它输出['x1','x2']

vars_outputs = ["x1", "x2", "ratio_x1_x2"]
missing_labels = {"ratio_x1_x2" : ["x1", "x2"]}

it = [df[missing_labels[var][0]].iloc[-1]/df[missing_labels[var][1]].iloc[-1] if var in missing_labels
        else df[var].iloc[-1] for var in vars_outputs]

t = np.fromiter(it, dtype = float)