使用python和numpy中的大数据(内存不足),如何在磁盘上保存部分结果?

时间:2022-03-14 03:09:39

I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx and other usefull libraries. I want to perform operations such as pairwise distance between all of the points and do clustering on all of the points. I have implemented working algorithms that perform what I want with reasonable complexity but when I try to scale them to all of my data I run out of ram. Of course I do, creating the matrix for pairwise distances on 200k+ data takes alot of memory.

我正在尝试用python的200k+ datapoints实现1000维数据的算法。我想使用numpy、scipy、sklearn、networkx和其他有用的库。我想执行一些操作,比如所有点之间的两两距离以及对所有点进行聚类。我已经实现了能够以合理的复杂性执行我想要的工作算法,但是当我试图将它们扩展到所有的数据时,我的ram就用完了。当然,在200k+数据上创建成对距离矩阵需要大量内存。

Here comes the catch: I would really like to do this on crappy computers with low amounts of ram.

问题来了:我真的很想在那些内存不足的电脑上做这个。

Is there a feasible way for me to make this work without the constraints of low ram. That it will take a much longer time is really not a problem, as long as the time reqs don't go to infinity!

有没有一种可行的方法可以让我不受低ram的限制而完成这项工作?只要时间不是无限的,它将花费更长的时间真的不是一个问题!

I would like to be able to put my algorithms to work and then come back an hour or five later and not have it stuck because it ran out of ram! I would like to implement this in python, and be able to use the numpy, scipy, sklearn and networkx libraries. I would like to be able to calculate the pairwise distance to all my points etc

我希望能够让我的算法发挥作用,然后在一到五小时后再回来,不要因为它耗尽了内存而卡住!我想在python中实现这一点,并且能够使用numpy、scipy、sklearn和networkx库。我希望能够计算到我所有点的成对距离等等

Is this feasible? And how would I go about it, what can I start to read up on?

这是可行的吗?我该怎么做呢,我该读些什么呢?

Best regards // Mesmer

/ /催眠师问好

2 个解决方案

#1


35  

Using numpy.memmap you create arrays directly mapped into a file:

使用numpy。可以创建直接映射到文件中的数组:

import numpy
a = numpy.memmap('test.mymemmap', dtype='float32', mode='w+', shape=(200000,1000))
# here you will see a 762MB file created in your working directory    

You can treat it as a conventional array: a += 1000.

可以将其视为常规数组:a += 1000。

It is possible even to assign more arrays to the same file, controlling it from mutually sources if needed. But I've experiences some tricky things here. To open the full array you have to "close" the previous one first, using del:

甚至可以为同一个文件分配更多的数组,如果需要,可以从相互的源控制它。但我经历过一些棘手的事情。要打开整个数组,您必须先“关闭”前面的数组,使用del:

del a    
b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(200000,1000))

But openning only some part of the array makes it possible to achieve the simultaneous control:

但是只打开数组的一部分就可以实现同步控制:

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000))
b[1,5] = 123456.
print a[1,5]
#123456.0

Great! a was changed together with b. And the changes are already written on disk.

太棒了!a与b一起更改,并且更改已经写在磁盘上。

The other important thing worth commenting is the offset. Suppose you want to take not the first 2 lines in b, but lines 150000 and 150001.

另一件值得注意的重要事情是偏移量。假设你想要的不是b中的前两行,而是150000行和150001行。

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000),
                 offset=150000*1000*32/8)
b[1,2] = 999999.
print a[150001,2]
#999999.0

Now you can access and update any part of the array in simultaneous operations. Note the byte-size going in the offset calculation. So for a 'float64' this example would be 150000*1000*64/8.

现在,您可以在同时操作中访问和更新数组的任何部分。注意偏移量计算中的字节大小。对于一个float64,这个例子是150000*1000*64/8。

Other references:

其他参考资料:

#2


-2  

You could just ramp up the virtual memory on the OS and use 64-bit python, providing it's a 64-bit os.

只要在操作系统上增加虚拟内存并使用64位python,就可以提供64位操作系统。

#1


35  

Using numpy.memmap you create arrays directly mapped into a file:

使用numpy。可以创建直接映射到文件中的数组:

import numpy
a = numpy.memmap('test.mymemmap', dtype='float32', mode='w+', shape=(200000,1000))
# here you will see a 762MB file created in your working directory    

You can treat it as a conventional array: a += 1000.

可以将其视为常规数组:a += 1000。

It is possible even to assign more arrays to the same file, controlling it from mutually sources if needed. But I've experiences some tricky things here. To open the full array you have to "close" the previous one first, using del:

甚至可以为同一个文件分配更多的数组,如果需要,可以从相互的源控制它。但我经历过一些棘手的事情。要打开整个数组,您必须先“关闭”前面的数组,使用del:

del a    
b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(200000,1000))

But openning only some part of the array makes it possible to achieve the simultaneous control:

但是只打开数组的一部分就可以实现同步控制:

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000))
b[1,5] = 123456.
print a[1,5]
#123456.0

Great! a was changed together with b. And the changes are already written on disk.

太棒了!a与b一起更改,并且更改已经写在磁盘上。

The other important thing worth commenting is the offset. Suppose you want to take not the first 2 lines in b, but lines 150000 and 150001.

另一件值得注意的重要事情是偏移量。假设你想要的不是b中的前两行,而是150000行和150001行。

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000),
                 offset=150000*1000*32/8)
b[1,2] = 999999.
print a[150001,2]
#999999.0

Now you can access and update any part of the array in simultaneous operations. Note the byte-size going in the offset calculation. So for a 'float64' this example would be 150000*1000*64/8.

现在,您可以在同时操作中访问和更新数组的任何部分。注意偏移量计算中的字节大小。对于一个float64,这个例子是150000*1000*64/8。

Other references:

其他参考资料:

#2


-2  

You could just ramp up the virtual memory on the OS and use 64-bit python, providing it's a 64-bit os.

只要在操作系统上增加虚拟内存并使用64位python,就可以提供64位操作系统。