如何读大熊猫中的大型json?

时间:2022-12-18 20:22:47

My code is :data_review=pd.read_json('review.json') I have the data review as fllow:

我的代码是:data_review = pd.read_json('review.json')我将数据审查为fllow:

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

But I got the follow error:

但我得到了以下错误:

    333             fh, handles = _get_handle(filepath_or_buffer, 'r',
    334                                       encoding=encoding)
--> 335             json = fh.read()
    336             fh.close()
    337         else:

OSError: [Errno 22] Invalid argument

My jsonfile do not contain any comments and 3.8G! I just download the file from here to practice link

我的jsonfile不包含任何评论和3.8G!我只是从这里下载文件到练习链接

When I use the follow code,throw the same error:

当我使用以下代码时,抛出相同的错误:

import json
with open('review.json') as json_file:
    data = json.load(json_file)

1 个解决方案

#1


3  

Perhaps, the file you are reading contains multiple json objects rather and than a single json or array object which the methods json.load(json_file) and pd.read_json('review.json') are expecting. These methods are supposed to read files with single json object.

也许,您正在阅读的文件包含多个json对象,而不是单个json或数组对象,方法json.load(json_file)和pd.read_json('review.json')期望这些对象。这些方法应该用单个json对象读取文件。

From the yelp dataset I have seen, your file must be containing something like:

从我看到的yelp数据集中,您的文件必须包含以下内容:

{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0}
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0}
....    
....

and so on.

Hence, it is important to realize that this is not single json data rather it is multiple json objects in one file.

因此,重要的是要意识到这不是单个json数据,而是一个文件中的多个json对象。

To read this data into pandas data frame the following solution should work:

要将此数据读入pandas数据框,以下解决方案应该有效:

import pandas as pd

with open('review.json') as json_file:      
    data = json_file.readlines()
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)

Assuming the size of data to be pretty large, I think your machine will take considerable amount of time to load the data into data frame.

假设数据的大小非常大,我认为您的计算机将花费大量时间将数据加载到数据框中。

#1


3  

Perhaps, the file you are reading contains multiple json objects rather and than a single json or array object which the methods json.load(json_file) and pd.read_json('review.json') are expecting. These methods are supposed to read files with single json object.

也许,您正在阅读的文件包含多个json对象,而不是单个json或数组对象,方法json.load(json_file)和pd.read_json('review.json')期望这些对象。这些方法应该用单个json对象读取文件。

From the yelp dataset I have seen, your file must be containing something like:

从我看到的yelp数据集中,您的文件必须包含以下内容:

{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0}
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0}
....    
....

and so on.

Hence, it is important to realize that this is not single json data rather it is multiple json objects in one file.

因此,重要的是要意识到这不是单个json数据,而是一个文件中的多个json对象。

To read this data into pandas data frame the following solution should work:

要将此数据读入pandas数据框,以下解决方案应该有效:

import pandas as pd

with open('review.json') as json_file:      
    data = json_file.readlines()
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)

Assuming the size of data to be pretty large, I think your machine will take considerable amount of time to load the data into data frame.

假设数据的大小非常大,我认为您的计算机将花费大量时间将数据加载到数据框中。