Python在维护串行读取的同时压缩一系列JSON对象?

时间:2021-07-15 18:11:05

I have a bunch of json objects that I need to compress as it's eating too much disk space, approximately 20 gigs worth for a few million of them.

我有一堆json对象,我需要压缩,因为它占用了太多的磁盘空间,大约20个演出价值几百万。

Ideally what I'd like to do is compress each individually and then when I need to read them, just iteratively load and decompress each one. I tried doing this by creating a text file with each line being a compressed json object via zlib, but this is failing with a

理想情况下,我想要做的是单独压缩每个,然后当我需要读取它们时,只需迭代加载和解压缩每个。我尝试通过创建一个文本文件来做到这一点,每个行都是通过zlib压缩的json对象,但这是失败的

decompress error due to a truncated stream,

由于截断的流导致的解压缩错误,

which I believe is due to the compressed strings containing new lines.

我认为这是由于包含新行的压缩字符串。

Anyone know of a good method to do this?

有人知道这样做的好方法吗?

2 个解决方案

#1


24  

Just use a gzip.GzipFile() object and treat it like a regular file; write JSON objects line by line, and read them line by line.

只需使用gzip.GzipFile()对象并将其视为常规文件;逐行编写JSON对象,并逐行读取它们。

The object takes care of compression transparently, and will buffer reads, decompressing chucks as needed.

该对象透明地处理压缩,并将缓冲读取,根据需要解压缩卡盘。

import gzip
import json

# writing
with gzip.GzipFile(jsonfilename, 'w') as outfile:
    for obj in objects:
        outfile.write(json.dumps(obj) + '\n')

# reading
with gzip.GzipFile(jsonfilename, 'r') as isfile:
    for line in infile:
        obj = json.loads(line)
        # process obj

This has the added advantage that the compression algorithm can make use of repetition across objects for compression ratios.

这具有额外的优点,即压缩算法可以利用跨对象的重复来获得压缩比。

#2


0  

You might want to try an incremental json parser, such as jsaone.

您可能想尝试增量json解析器,例如jsaone。

That is, create a single json with all your objects, and parse it like

也就是说,用你的所有对象创建一个json,并解析它

with gzip.GzipFile(file_path, 'r') as f_in:
    for key, val in jsaone.load(f_in):
        ...

This is quite similar to Martin's answer, wasting slightly more space but maybe slightly more comfortable.

这与马丁的答案非常相似,浪费了更多的空间,但可能稍微舒服一些。

EDIT: oh, by the way, it's probably fair to clarify that wrote jsaone.

编辑:哦,顺便说一句,澄清写jsaone可能是公平的。

#1


24  

Just use a gzip.GzipFile() object and treat it like a regular file; write JSON objects line by line, and read them line by line.

只需使用gzip.GzipFile()对象并将其视为常规文件;逐行编写JSON对象,并逐行读取它们。

The object takes care of compression transparently, and will buffer reads, decompressing chucks as needed.

该对象透明地处理压缩,并将缓冲读取,根据需要解压缩卡盘。

import gzip
import json

# writing
with gzip.GzipFile(jsonfilename, 'w') as outfile:
    for obj in objects:
        outfile.write(json.dumps(obj) + '\n')

# reading
with gzip.GzipFile(jsonfilename, 'r') as isfile:
    for line in infile:
        obj = json.loads(line)
        # process obj

This has the added advantage that the compression algorithm can make use of repetition across objects for compression ratios.

这具有额外的优点,即压缩算法可以利用跨对象的重复来获得压缩比。

#2


0  

You might want to try an incremental json parser, such as jsaone.

您可能想尝试增量json解析器,例如jsaone。

That is, create a single json with all your objects, and parse it like

也就是说,用你的所有对象创建一个json,并解析它

with gzip.GzipFile(file_path, 'r') as f_in:
    for key, val in jsaone.load(f_in):
        ...

This is quite similar to Martin's answer, wasting slightly more space but maybe slightly more comfortable.

这与马丁的答案非常相似,浪费了更多的空间,但可能稍微舒服一些。

EDIT: oh, by the way, it's probably fair to clarify that wrote jsaone.

编辑:哦,顺便说一句,澄清写jsaone可能是公平的。