Python：尝试反序列化文件中的多个JSON对象，每个对象跨越多个但间隔一致的行数

Ok, after nearly a week of research I'm going to give SO a shot. I have a text file that looks as follows (showing 3 separate json objects as an example but file has 50K of these):

好吧，经过近一周的研究，我将给予一个机会。我有一个文本文件，如下所示（显示3个单独的json对象作为示例，但文件有50K这些）：

{
"zipcode":"00544",
"current":{"canwc":null,"cig":7000,"class":"observation"},
"triggers":[178,30,176,103,179,112,21,20,48,7,50,40,57]
}
{
"zipcode":"00601",
"current":{"canwc":null,"cig":null,"class":"observation"},
"triggers":[12,23,34,28,100]
}
{
"zipcode":"00602",
"current":{"canwc":null,"cig":null,"class":"observation"},
"triggers":[13,85,43,101,38,31]
}

I know how to work with JSON objects using the Python json library but I'm having a challenge with how to create 50 thousand different json objects from reading the file. (Perhaps I'm not even thinking about this correctly but ultimately I need to deserialize and load into a database) I've tried itertools thinking that I need a generator so I was able to use:

我知道如何使用Python json库使用JSON对象，但我对如何通过读取文件创建5万个不同的json对象提出了挑战。（也许我甚至没有正确地思考这个问题，但最终我需要反序列化并加载到数据库中）我已经尝试过itertools认为我需要一个生成器所以我能够使用：

with open(file) as f:
    for line in itertools.islice(f, 0, 7): #since every 7 lines is a json object
        jfile = json.load(line)

But the above obviously won't work since it is not reading the 7 lines as a single json object and I'm also not sure how to then iterate on entire file and load individual json objects.

但上面显然不会起作用，因为它不是将7行作为单个json对象读取，而且我也不确定如何迭代整个文件并加载单个json对象。

The following would give me a list I can slice:

以下将给我一个我可以切片的列表：

list(open(file))[:7]

Any help would be really appreciated.

任何帮助将非常感激。

Extemely close to what I need and I think literally one step away but still struggling a little with iteration. This will finally get me an iterative printout of all of the dataframes but how do I make it so that I can capture one giant dataframe with all of the pieces essentially concatenated? I could then export that final dataframe to csv etc. (Also is there a better way to upload this result into a database rather than creating a giant dataframe first?)

非常接近我需要的东西，我认为只有一步之遥，但仍然在迭代中苦苦挣扎。这最终会让我对所有数据帧进行迭代打印输出，但是如何制作它以便我可以捕获一个巨大的数据帧，其中所有的部分基本连接在一起？然后我可以将最终的数据帧导出到csv等。（还有一种更好的方法可以将此结果上传到数据库而不是先创建一个巨大的数据帧吗？）

def lines_per_n(f, n):
    for line in f:
        yield ''.join(chain([line], itertools.islice(f, n - 1)))

def flatten(jfile):
    for k, v in jfile.items():
        if isinstance(v, list):
            jfile[k] = ','.join(v)
        elif isinstance(v, dict):
            for kk, vv in v.items():
                jfile['%s' % (kk)] = vv
            del jfile[k]
            return jfile

with open('deadzips.json') as f:
    for chunk in lines_per_n(f, 7):
        try:
            jfile = json.loads(chunk)
            pd.DataFrame(flatten(jfile).items())
        except ValueError, e:
            pass
        else:
            pass

2 个解决方案

#1

Load 6 extra lines instead, and pass the string to json.loads():

加载6个额外的行，并将字符串传递给json.loads（）：

with open(file) as f:
    for line in f:
        # slice the next 6 lines from the iterable, as a list.
        lines = [line] + list(itertools.islice(f, 6))
        jfile = json.loads(''.join(lines))

        # do something with jfile

json.load() will slurp up more than just the next object in the file, and islice(f, 0, 7) would read only the first 7 lines, rather than read the file in 7-line blocks.

json.load（）将不仅仅是文件中的下一个对象，而islice（f，0,7）只读取前7行，而不是读取7行块中的文件。

You can wrap reading a file in blocks of size N in a generator:

您可以在生成器中以大小为N的块读取文件：

from itertools import islice, chain

def lines_per_n(f, n):
    for line in f:
        yield ''.join(chain([line], itertools.islice(f, n - 1)))

then use that to chunk up your input file:

然后用它来填充你的输入文件：

with open(file) as f:
    for chunk in lines_per_n(f, 7):
        jfile = json.loads(chunk)

        # do something with jfile

Alternatively, if your blocks turn out to be of variable length, read until you have something that parses:

或者，如果您的块变为长度可变，请阅读，直到您有解析的内容为止：

with open(file) as f:
    for line in f:
        while True:
            try:
                jfile = json.loads(line)
                break
            except ValueError:
                # Not yet a complete JSON value
                line += next(f)

        # do something with jfile

#2

As stated elsewhere, a general solution is to read the file in pieces, append each piece to the last, and try to parse that new chunk. If it doesn't parse, continue until you get something that does. Once you have something that parses, return it, and restart the process. Rinse-lather-repeat until you run out of data.

如其他地方所述，一般的解决方案是将文件分片读取，将每个片段附加到最后，并尝试解析该新块。如果它不解析，继续，直到你得到的东西。一旦你有解析的东西，返回它，然后重新启动过程。冲洗 - 泡沫重复，直到数据用完为止。

Here is a succinct generator that will do this:

这是一个简洁的生成器，它将执行此操作：

def load_json_multiple(segments):
    chunk = ""
    for segment in segments:
        chunk += segment
        try:
            yield json.loads(chunk)
            chunk = ""
        except ValueError:
            pass

Use it like this:

像这样用它：

with open('foo.json') as f:
   for parsed_json in load_json_multiple(f):
      print parsed_json

I hope this helps.

我希望这有帮助。

#1