python -将numpy数组保存到一个文件中(尽可能小的大小)

时间:2021-11-15 02:05:15

Right now I have a python program building a fairly large 2D numpy array and saving it as a tab delimited text file using numpy.savetxt. The numpy array contains only floats. I then read the file in one row at a time in a separate C++ program.

现在我有一个python程序构建一个相当大的2D numpy数组,并使用numpi .savetxt将其保存为一个带标签分隔的文本文件。numpy数组只包含浮点数。然后,我在一个独立的c++程序中以一行的速度读取文件。

What I would like to do is find a way to accomplish this same task, changing my code as little as possible such that I can decrease the size of the file I am passing between the two programs.


I found that I can use numpy.savetxt to save to a compressed .gz file instead of a text file. This lowers the file size from ~2MB to ~100kB.


Is there a better way to do this? Could I, perhaps, write the numpy array in binary to the file to save space? If so, how would I do this so that I can still read it into the C++ program?


Thank you for the help. I appreciate any guidance I can get.




There are a lot of zeros (probably 70% of the values in the numpy array are 0.0000) I am not sure of how I can somehow exploit this though and generate a tiny file that my c++ program can read in


5 个解决方案



Since you have a lot of zeroes, you could only write out the non-zero elements in the form (index, number).


Suppose you have an array with a small amount of nonzero numbers:


In [5]: a = np.zeros((10, 10))

In [6]: a
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [7]: a[3,1] = 2.0

In [8]: a[7,4] = 17.0

In [9]: a[9,0] = 1.5

First, isolate the interesting numbers and their indices:


In [11]: x, y = a.nonzero()

In [12]: zip(x,y)
Out[12]: [(3, 1), (7, 4), (9, 0)]

In [13]: nonzero = zip(x,y)

Now you only have a small number of data elements left. The easiest thing is to write them to a text file:


In [17]: with open('numbers.txt', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write('{:d} {:d} {:g}\n'.format(r, k, a[r,k]))

In [18]: cat numbers.txt
3 1 2
7 4 17
9 0 1.5

This also gives you an opportunity to eyeball the data. In your C++ program you can read this data with fscanf.


But you can reduce the size even more by writing binary data using struct:


In [17]: import struct

In [19]: c = struct.Struct('=IId')

In [20]: with open('numbers.bin', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write(c.pack(r, k, a[r,k]))

The argument to the Struct constructor means; use native date format '='. The first and second data elements are unsigned integers 'I', the third element is a double 'd'.


In your C++ program this data is probably best read as binary data into a packed struct.


EDIT: Answer updated for a 2D array.




Unless you are sure you don't need to worry about endianness and such, best use numpy.savez, as explained in @unutbu's answer and @jorgeca's comment here: numpy's tostring/fromstring --- what do I need to specify to restore the array.


If the resulting size is not small enough, there's always zlib (on python's side: import zlib, on the C++ side, I'm sure an implementation exists).

如果结果的大小不够小,则总是有zlib(在python这边:import zlib,在c++那边,我肯定有一个实现)。

An alternative would be to use hdf5 format: while it does not necessarily reduce the on-disk file size, it does make saving/loading faster (this is what the format was designed for, large data arrays). There are both python and C++ readers/writers for hdf5.




numpy.ndarray.tofile and numpy.fromfile are useful for direct binary output/input from python. std::ostream::write std::istream::read are useful for binary output/input in c++.

numpy.ndarray。tofile和numpi .fromfile对于python的直接二进制输出/输入非常有用。写入std: istream::read对于c++的二进制输出/输入非常有用。

You should be careful about endianess if the data are transferred from one machine to another.




Use the an hdf5 file, they are really simple to use through h5py and you can use set compression a flag. Note that hdf5 has also a c++ interface.




If you don't mind installing additional packages (for both python and c++), you can use [BSON][1] (Binary JSON).




Since you have a lot of zeroes, you could only write out the non-zero elements in the form (index, number).


Suppose you have an array with a small amount of nonzero numbers:


In [5]: a = np.zeros((10, 10))

In [6]: a
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [7]: a[3,1] = 2.0

In [8]: a[7,4] = 17.0

In [9]: a[9,0] = 1.5

First, isolate the interesting numbers and their indices:


In [11]: x, y = a.nonzero()

In [12]: zip(x,y)
Out[12]: [(3, 1), (7, 4), (9, 0)]

In [13]: nonzero = zip(x,y)

Now you only have a small number of data elements left. The easiest thing is to write them to a text file:


In [17]: with open('numbers.txt', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write('{:d} {:d} {:g}\n'.format(r, k, a[r,k]))

In [18]: cat numbers.txt
3 1 2
7 4 17
9 0 1.5

This also gives you an opportunity to eyeball the data. In your C++ program you can read this data with fscanf.


But you can reduce the size even more by writing binary data using struct:


In [17]: import struct

In [19]: c = struct.Struct('=IId')

In [20]: with open('numbers.bin', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write(c.pack(r, k, a[r,k]))

The argument to the Struct constructor means; use native date format '='. The first and second data elements are unsigned integers 'I', the third element is a double 'd'.


In your C++ program this data is probably best read as binary data into a packed struct.


EDIT: Answer updated for a 2D array.




Unless you are sure you don't need to worry about endianness and such, best use numpy.savez, as explained in @unutbu's answer and @jorgeca's comment here: numpy's tostring/fromstring --- what do I need to specify to restore the array.


If the resulting size is not small enough, there's always zlib (on python's side: import zlib, on the C++ side, I'm sure an implementation exists).

如果结果的大小不够小,则总是有zlib(在python这边:import zlib,在c++那边,我肯定有一个实现)。

An alternative would be to use hdf5 format: while it does not necessarily reduce the on-disk file size, it does make saving/loading faster (this is what the format was designed for, large data arrays). There are both python and C++ readers/writers for hdf5.




numpy.ndarray.tofile and numpy.fromfile are useful for direct binary output/input from python. std::ostream::write std::istream::read are useful for binary output/input in c++.

numpy.ndarray。tofile和numpi .fromfile对于python的直接二进制输出/输入非常有用。写入std: istream::read对于c++的二进制输出/输入非常有用。

You should be careful about endianess if the data are transferred from one machine to another.




Use the an hdf5 file, they are really simple to use through h5py and you can use set compression a flag. Note that hdf5 has also a c++ interface.




If you don't mind installing additional packages (for both python and c++), you can use [BSON][1] (Binary JSON).
