一个文件中的多个Json对象由python提取

时间:2022-09-15 11:40:18

I am very new to Json files. If I have a json file with multiple json objects such as following:

我是Json文件的新手。如果我有一个包含多个json对象的json文件,如下所示:

{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
  "Code":[{"event1":"A","result":"1"},…]}
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
  "Code":[{"event1":"B","result":"1"},…]}
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
  "Code":[{"event1":"B","result":"0"},…]}
…

I want to extract all "Timestamp" and "Usefulness" into a data frames:

我想将所有“时间戳”和“有用性”提取到数据框中:

    Timestamp    Usefulness
 0   20140101      Yes
 1   20140102      No
 2   20140103      No
 …

Does anyone know a general way to deal with such problems? Thanks!

有谁知道处理这些问题的一般方法?谢谢!

4 个解决方案

#1


11  

Use a json array, in the format:

使用json数组,格式为:

[
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
  "Code":[{"event1":"A","result":"1"},…]},
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
  "Code":[{"event1":"B","result":"1"},…]},
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
  "Code":[{"event1":"B","result":"0"},…]},
...
]

Then import it into your python code

然后将其导入您的python代码

json=open('file.json')

data = json.load(json)

Now the content of data is an array with dictionaries representing each of the elements.

现在,数据的内容是一个数组,其中的字典代表每个元素。

You can access it easily, i.e:

您可以轻松访问它,即:

data[0]["ID"]

#2


4  

You can use json.JSONDecoder.raw_decode to decode arbitarily big strings of "stacked" JSON (so long as they can fit in memory). raw_decode stops once it has a valid object and returns the last position where wasn't part of the parsed object. It's not documented, but you can pass this position back to raw_decode and it start parsing again from that position. Unfortunately, the Python json module doesn't accept strings that have prefixing whitespace. So we need to search to find the first none-whitespace part of your document.

您可以使用json.JSONDecoder.raw_decode来解码任意大的“堆叠”JSON字符串(只要它们可以适合内存)。 raw_decode在有有效对象后停止,并返回不在解析对象中的最后位置。它没有记录,但您可以将此位置传递回raw_decode,然后从该位置再次开始解析。不幸的是,Python json模块不接受具有前缀空格的字符串。因此,我们需要搜索以查找文档的第一个非空白部分。

from json import JSONDecoder, JSONDecodeError
import re

NOT_WHITESPACE = re.compile(r'[^\s]')

def decode_stacked(document, pos=0, decoder=JSONDecoder()):
    while True:
        match = NOT_WHITESPACE.search(document, pos)
        if not match:
            return
        pos = match.start()

        try:
            obj, pos = decoder.raw_decode(document, pos)
        except JSONDecodeError:
            # do something sensible if there's some error
            raise
        yield obj

s = """

{"a": 1}  


   [
1
,   
2
]


"""

for obj in decode_stacked(s):
    print(obj)

prints:

打印:

{'a': 1}
[1, 2]

#3


2  

So, as was mentioned in a couple comments containing the data in an array is simpler but the solution does not scale well in terms of efficiency as the data set size increases. You really should only use an iterator when you want to access a random object in the array, otherwise, generators are the way to go. Below I have prototyped a reader function which reads each json object individually and returns a generator.

因此,正如在一些包含数组中的数据的注释中提到的那样简单,但随着数据集大小的增加,解决方案在效率方面不能很好地扩展。当你想要访问数组中的随机对象时,你真的应该只使用迭代器,否则,生成器是可行的方法。下面我有一个原型读取器函数,它单独读取每个json对象并返回一个生成器。

The basic idea is to signal the reader to split on the carriage character "\n" (or "\r\n" for Windows). Python can do this with the file.readline() function.

基本思想是通知读者分割托架字符“\ n”(或Windows的“\ r \ n”)。 Python可以使用file.readline()函数执行此操作。

import json
def json_readr(file):
    for line in open(file, mode="r"):
        yield json.loads(line)

However, this method only really works when the file is written as you have it -- with each object separated by a new line character. Below I wrote an example of a writer that separates an array of json objects and saves each one on a new line.

但是,此方法仅在文件写入时才真正起作用 - 每个对象用新的行字符分隔。下面我写了一个编写器的例子,它分隔了一个json对象数组并将每个对象保存在一个新行上。

def json_writr(file, json_objects):
    f = open(file, mode="w")
    for jsonobj in json_objects:
        jsonstr = json.dumps(jsonobj)
        f.write(jsonstr+"\n")
    f.flush()
    f.close()

You could also do the same operation with file.writelines() and list comprehension

您也可以使用file.writelines()和list comprehension执行相同的操作

...
    jsobjs = [json.dumps(j)+"\n" for j in json_objects]
    f.writelines(jsobjs)
...

And if you wanted to append the data instead of writing a new file just change ' mode="w" ' to ' mode="a" '.

如果你想附加数据而不是写一个新文件,只需将'mode =“w”'更改为'mode =“a”'。

In the end I find this helps a great deal not only with readability when I try and open json files in text editor but also in terms of using memory more efficiently.

最后,我发现当我尝试在文本编辑器中打开json文件时,这不仅有助于提高可读性,而且还有助于更有效地使用内存。

On that note if you change you mind at some point and you want a list out of the reader, Python allows you to put a generator function inside of a list and populate the list automatically. In other words, just write

在那个注意事项中,如果你在某个时候改变了想法,并且想要一个列表,那么Python允许你将一个生成器函数放在一个列表中并自动填充列表。换句话说,就是写

lst = list(json_readr(file))

Hope this helps. Sorry if it was a bit verbose.

希望这可以帮助。对不起,如果它有点冗长。

#4


0  

As you parse through the objects, you are dealing with dictionaries. You can extract the values you need by searching via key. E.g. value = jsonDictionary['Usefulness'].

在解析对象时,您正在处理字典。您可以通过搜索键来提取所需的值。例如。 value = jsonDictionary ['有用性']。

You can loop through the json objects by using a for loop. e.g.:

您可以使用for循环遍历json对象。例如。:

for obj in bunchOfObjs:
    value = obj['Usefulness']
    #now do something with your value, e.g insert into panda....

#1


11  

Use a json array, in the format:

使用json数组,格式为:

[
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
  "Code":[{"event1":"A","result":"1"},…]},
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
  "Code":[{"event1":"B","result":"1"},…]},
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
  "Code":[{"event1":"B","result":"0"},…]},
...
]

Then import it into your python code

然后将其导入您的python代码

json=open('file.json')

data = json.load(json)

Now the content of data is an array with dictionaries representing each of the elements.

现在,数据的内容是一个数组,其中的字典代表每个元素。

You can access it easily, i.e:

您可以轻松访问它,即:

data[0]["ID"]

#2


4  

You can use json.JSONDecoder.raw_decode to decode arbitarily big strings of "stacked" JSON (so long as they can fit in memory). raw_decode stops once it has a valid object and returns the last position where wasn't part of the parsed object. It's not documented, but you can pass this position back to raw_decode and it start parsing again from that position. Unfortunately, the Python json module doesn't accept strings that have prefixing whitespace. So we need to search to find the first none-whitespace part of your document.

您可以使用json.JSONDecoder.raw_decode来解码任意大的“堆叠”JSON字符串(只要它们可以适合内存)。 raw_decode在有有效对象后停止,并返回不在解析对象中的最后位置。它没有记录,但您可以将此位置传递回raw_decode,然后从该位置再次开始解析。不幸的是,Python json模块不接受具有前缀空格的字符串。因此,我们需要搜索以查找文档的第一个非空白部分。

from json import JSONDecoder, JSONDecodeError
import re

NOT_WHITESPACE = re.compile(r'[^\s]')

def decode_stacked(document, pos=0, decoder=JSONDecoder()):
    while True:
        match = NOT_WHITESPACE.search(document, pos)
        if not match:
            return
        pos = match.start()

        try:
            obj, pos = decoder.raw_decode(document, pos)
        except JSONDecodeError:
            # do something sensible if there's some error
            raise
        yield obj

s = """

{"a": 1}  


   [
1
,   
2
]


"""

for obj in decode_stacked(s):
    print(obj)

prints:

打印:

{'a': 1}
[1, 2]

#3


2  

So, as was mentioned in a couple comments containing the data in an array is simpler but the solution does not scale well in terms of efficiency as the data set size increases. You really should only use an iterator when you want to access a random object in the array, otherwise, generators are the way to go. Below I have prototyped a reader function which reads each json object individually and returns a generator.

因此,正如在一些包含数组中的数据的注释中提到的那样简单,但随着数据集大小的增加,解决方案在效率方面不能很好地扩展。当你想要访问数组中的随机对象时,你真的应该只使用迭代器,否则,生成器是可行的方法。下面我有一个原型读取器函数,它单独读取每个json对象并返回一个生成器。

The basic idea is to signal the reader to split on the carriage character "\n" (or "\r\n" for Windows). Python can do this with the file.readline() function.

基本思想是通知读者分割托架字符“\ n”(或Windows的“\ r \ n”)。 Python可以使用file.readline()函数执行此操作。

import json
def json_readr(file):
    for line in open(file, mode="r"):
        yield json.loads(line)

However, this method only really works when the file is written as you have it -- with each object separated by a new line character. Below I wrote an example of a writer that separates an array of json objects and saves each one on a new line.

但是,此方法仅在文件写入时才真正起作用 - 每个对象用新的行字符分隔。下面我写了一个编写器的例子,它分隔了一个json对象数组并将每个对象保存在一个新行上。

def json_writr(file, json_objects):
    f = open(file, mode="w")
    for jsonobj in json_objects:
        jsonstr = json.dumps(jsonobj)
        f.write(jsonstr+"\n")
    f.flush()
    f.close()

You could also do the same operation with file.writelines() and list comprehension

您也可以使用file.writelines()和list comprehension执行相同的操作

...
    jsobjs = [json.dumps(j)+"\n" for j in json_objects]
    f.writelines(jsobjs)
...

And if you wanted to append the data instead of writing a new file just change ' mode="w" ' to ' mode="a" '.

如果你想附加数据而不是写一个新文件,只需将'mode =“w”'更改为'mode =“a”'。

In the end I find this helps a great deal not only with readability when I try and open json files in text editor but also in terms of using memory more efficiently.

最后,我发现当我尝试在文本编辑器中打开json文件时,这不仅有助于提高可读性,而且还有助于更有效地使用内存。

On that note if you change you mind at some point and you want a list out of the reader, Python allows you to put a generator function inside of a list and populate the list automatically. In other words, just write

在那个注意事项中,如果你在某个时候改变了想法,并且想要一个列表,那么Python允许你将一个生成器函数放在一个列表中并自动填充列表。换句话说,就是写

lst = list(json_readr(file))

Hope this helps. Sorry if it was a bit verbose.

希望这可以帮助。对不起,如果它有点冗长。

#4


0  

As you parse through the objects, you are dealing with dictionaries. You can extract the values you need by searching via key. E.g. value = jsonDictionary['Usefulness'].

在解析对象时,您正在处理字典。您可以通过搜索键来提取所需的值。例如。 value = jsonDictionary ['有用性']。

You can loop through the json objects by using a for loop. e.g.:

您可以使用for循环遍历json对象。例如。:

for obj in bunchOfObjs:
    value = obj['Usefulness']
    #now do something with your value, e.g insert into panda....