I have an array called start_similarity_results
with size 47000*90000, with each element is a float number between 0 and 1. For each row, I need to find out the col indices at which position the float number is greater than a threshold, and from all these qualified col indices, I will randomly pick out one. Now my code looks like:
我有一个名为start_similarity_results的数组,大小为47000 * 90000,每个元素都是一个介于0和1之间的浮点数。对于每一行,我需要找出浮点数大于阈值的位置的col索引,并且所有这些合格的col指数,我会随机挑出一个。现在我的代码看起来像:
out_start = np.ones(47000)*-1
cur_row_start = 0
col_list_start = []
for r_start, c_start in zip(*(np.nonzero(start_similarity_results>=similarity_threshold))):
if r_start == cur_row_start:
col_list_start.append(c_start)
else:
random.shuffle(col_list_start)
if len(col_list_start) != 0:
out_start[cur_row_start] = col_list_start[0]
cur_row_start = r_start
col_list_start = []
col_list_start.append(c_start)
random.shuffle(col_list_start)
if len(col_list_start) != 0:
out_start[cur_row_start] = col_list_start[0]
So in the end, I can get an array called out_start
with size 47000*1, 47000 is the number of rows in order, and for each row, there is a col index which I will use this array for future processing.
所以最后,我可以得到一个名为out_start的数组,大小为47000 * 1,47000是按顺序排列的行数,对于每一行,都有一个col索引,我将使用该数组进行后续处理。
However, when I run my code, I meet memory error at
但是,当我运行我的代码时,我遇到了内存错误
for r_start, c_start in zip(*(np.nonzero(start_similarity_results>=similarity_threshold)))
which seems that my array (47000*90000) is too big for the processor so it just stops. So I am wondering whether I can split my array into several parts and run them in parallel on multi cores. The important thing is that I will get the same out_start
as it is now.
看来我的阵列(47000 * 90000)对于处理器来说太大了,所以它就停止了。所以我想知道我是否可以将我的阵列分成几个部分并在多核上并行运行它们。重要的是,我将获得与现在相同的out_start。
1 个解决方案
#1
Well first of all, multiprocessing or threading is not going to help you with a memory error.
首先,多处理或线程不会帮助您解决内存错误。
Here's a function that I think should solve your problem, if I'm understanding it correctly. For each row, it gets a random column that is above threshold, or else -1:
如果我正确地理解了这个功能,我认为应该解决您的问题。对于每一行,它获得一个高于阈值的随机列,否则为-1:
import numpy as np
import random
def get_cols(x, thresh):
out = []
for row in x:
above = np.where(row>=thresh)
if above[0].any():
out.append(random.choice(above[0]))
else:
out.append(-1)
return np.array(out)
And here's the example input that you gave and the output:
这是您提供的示例输入和输出:
x = np.array([[0.1, 0.2, 0.3, 0.4], [0.2, 0.1, 0.8, 0.02],
[0.4, 0.4, 0.8, 0.002], [0.5, 0.4, 0.2, 0.6],
[0.4, 0.8, 0.2, 0.65], [0.1, 0.1, 0.1, 0.1]])
print get_cols(x, 0.3)
# [ 3 2 0 0 0 -1]
# [ 3 2 0 1 0 -1]
# [ 3 2 0 3 0 -1]
#1
Well first of all, multiprocessing or threading is not going to help you with a memory error.
首先,多处理或线程不会帮助您解决内存错误。
Here's a function that I think should solve your problem, if I'm understanding it correctly. For each row, it gets a random column that is above threshold, or else -1:
如果我正确地理解了这个功能,我认为应该解决您的问题。对于每一行,它获得一个高于阈值的随机列,否则为-1:
import numpy as np
import random
def get_cols(x, thresh):
out = []
for row in x:
above = np.where(row>=thresh)
if above[0].any():
out.append(random.choice(above[0]))
else:
out.append(-1)
return np.array(out)
And here's the example input that you gave and the output:
这是您提供的示例输入和输出:
x = np.array([[0.1, 0.2, 0.3, 0.4], [0.2, 0.1, 0.8, 0.02],
[0.4, 0.4, 0.8, 0.002], [0.5, 0.4, 0.2, 0.6],
[0.4, 0.8, 0.2, 0.65], [0.1, 0.1, 0.1, 0.1]])
print get_cols(x, 0.3)
# [ 3 2 0 0 0 -1]
# [ 3 2 0 1 0 -1]
# [ 3 2 0 3 0 -1]