
时间:2022-04-05 16:01:23

I have a movie dataset, the structe is this:


{u'detail_url': u'',
 u'douban_info': {u'aka': [u'\u5929\u5751\u575f\u5730',
                  u'alt': u'',
                  u'casts': [{u'alt': u'',
                              u'avatars': {u'large': u'',
                                           u'medium': u'',
                                           u'small': u''},
                              u'id': u'1314499',
                              u'name': u'\u7ea6\u745f\u592b\xb7\u6469\u6839'},
                             {u'alt': u'',
                              u'avatars': {u'large': u'',
                                           u'medium': u'',
                                           u'small': u''},
                              u'id': u'1036300',
                              u'name': u'\u6c99\u5c14\u6258\xb7\u79d1\u666e\u96f7'},
                             {u'alt': u'',
                              u'avatars': {u'large': u'',
                                           u'medium': u'',
                                           u'small': u''},
                              u'id': u'1049595',
                              u'name': u'\u6258\u9a6c\u65af\xb7\u514b\u83b1\u8212\u66fc'},
                             {u'alt': u'',
                              u'avatars': {u'large': u'',
                                           u'medium': u'',
                                           u'small': u''},
                              u'id': u'1318450',
                              u'name': u'\u827e\u7433\xb7\u7406\u67e5\u5179'}],
                  u'collect_count': 1507,
                  u'comments_count': 468,
                  u'countries': [u'\u7f8e\u56fd'],
                  u'current_season': None,
                  u'directors': [{u'alt': u'',
                                  u'avatars': {u'large': u'',
                                               u'medium': u'',
                                               u'small': u''},
                                  u'id': u'1302444',
                                  u'name': u'\u5188\u624e\u7f57\xb7\u6d1b\u4f69\u5179-\u52a0\u52d2\u679c'}],
                  u'do_count': None,
                  u'douban_site': u'',
                  u'episodes_count': None,
                  u'genres': [u'\u6050\u6016'],
                  u'id': u'13899371',
                  u'images': {u'large': u'',
                              u'medium': u'',
                              u'small': u''},
                  u'mobile_url': u'',
                  u'original_title': u'Open Grave',
                  u'rating': {u'average': 5.6,
                              u'max': 10,
                              u'min': 0,
                              u'stars': u'30'},
                  u'ratings_count': 1283,
                  u'reviews_count': 6,
                  u'schedule_url': u'',
                  u'seasons_count': None,
                  u'subtype': u'movie',
                  u'summary': u'\u672c\u8bb2\u8ff0\u4e86\u67d0\u4e2a\u504f\u50fb\u8352\u51c9\u7684\u68ee\u6797\u91cc\uff0c\u516d\u4e2a\u4eba\u5728\u4e00\u5904\u6709\u8150\u5c38\u7684\u9732\u5929\u575f\u573a\u65c1\u9192\u6765\uff0c\u5374\u597d\u50cf\u5f97\u4e86\u5065\u5fd8\u4e00\u822c\u7684\u6050\u6016\u95f9\u9b3c\u6545\u4e8b\u3002\u4ed6\u4eec\u65e0\u5904\u53ef\u53bb\uff0c\u88ab\u8feb\u628a\u795e\u79d8\u4e8b\u4ef6\u7684\u7ebf\u7d22\u62fc\u51d1\u8d77\u6765\uff0c\u6700\u7ec8\u5c06\u5e26\u9886\u4ed6\u4eec\u8d70\u8fdb\u60ca\u4eba\u7684\u7ed3\u5c40\uff0c\u800c\u4e0d\u81f3\u4e8e\u4f7f\u771f\u76f8\u6765\u7684\u592a\u665a\u3002',
                  u'title': u'\u5f00\u68fa',
                  u'wish_count': 640,
                  u'year': u'2013'},
 u'movie_tt_id': u'tt2071550',
 u'name': u'Open Grave ',
 u'omdb_info': {u'Actors': u'Sharlto Copley, Thomas Kretschmann, Josie Ho, Joseph Morgan',
                u'Awards': u'2 nominations.',
                u'Country': u'USA',
                u'Director': u'Gonzalo L\xf3pez-Gallego',
                u'Genre': u'Horror, Mystery, Thriller',
                u'Language': u'English',
                u'Metascore': u'33',
                u'Plot': u'A man wakes up in the wilderness, in a pit full of dead bodies, with no memory and must determine if the murderer is one of the strangers who rescued him, or if he himself is the killer.',
                u'Poster': u'',
                u'Rated': u'R',
                u'Released': u'3 Jan 2014',
                u'Response': u'True',
                u'Runtime': u'102 min',
                u'Title': u'Open Grave',
                u'Type': u'movie',
                u'Writer': u'Eddie Borey, Chris Borey',
                u'Year': u'2013',
                u'imdbID': u'tt2071550',
                u'imdbRating': u'6.3',
                u'imdbVotes': u'18896'}}

So, its deeped nested dataset. To read it into Pandas, I think there are 2 options


  1. Just extract the necessray info inner nested, make it into a column in the dataframe
  2. 只需提取必要的内部嵌套信息,将其放入dataframe的一个列中。
  3. Turn nested dataset into dataframes as well, and then merge them into the parent dataset.
  4. 将嵌套数据集转换为dataframes,然后将它们合并到父数据集。


= = = = = = = = = = = = = =

I'm not sure the current best practice for this, so I got these problems when doing the above:


1.I don't know how to extract the inner json


import json
from pprint import pprint
import pandas as pd
from pandas import Series,DataFrame

with open('..\lib\movie_list_2014_v2.json') as data_file:    
    data = json.load(data_file)

pd_data = DataFrame(data)

Gives me error, I believe its becuase[omdb_info] is un-parsed json


2.I looked into books, looks like there is no auto-conversion to read nested data into dataframe, so I need to manually make all of them into dataframes. I think this is very painful. (A lot of nested info in douban_info)


1 个解决方案



You can use the pandas json_normalize function to flatten the JSON, although it will result in some long column names for the more deeply nested data.


from import json_normalize

result = json_normalize(movies)

From there, you can deal with only the columns that you need.




You can use the pandas json_normalize function to flatten the JSON, although it will result in some long column names for the more deeply nested data.


from import json_normalize

result = json_normalize(movies)

From there, you can deal with only the columns that you need.
