使用json.loads时的Python分段错误 - 将JSON加载到列表中的替代方法?

时间:2022-04-29 07:18:47

I am trying to load some JSON Twitter data into a list, but instead I'm getting segmemtation fault (core dumped).

我试图将一些JSON Twitter数据加载到列表中,但我得到了segmemtation错误(核心转储)。

While I would love to upgrade my memory, that simply isn't an option right now. I would like to know if there is some way to maybe iterate over this list instead of trying to trying to load it all into memory? Or maybe there is a different kind of solution to this problem that will allow me to load this JSON data into a list?

虽然我很想升级我的记忆,但现在根本不是一个选择。我想知道是否有某种方法可以迭代这个列表而不是试图将其全部加载到内存中?或者也许有一种不同类型的解决方案可以让我将这个JSON数据加载到列表中?

In [1]: import json

In [2]: data = []

In [3]: for i in open('tweets.json'):
   ...:     try:
   ...:         data.append(json.loads(i))
   ...:     except:
   ...:         pass
   ...:     

Segmentation fault (core dumped)

The data was collected using the Twitter Streaming API over about 10 days and is 213M in size.

使用Twitter Streaming API在大约10天内收集数据,大小为213M。

Machine Specs:

  • Oracle VM Virtual Box
  • Oracle VM虚拟盒

  • Operating System: Ubuntu (64 bit)
  • 操作系统:Ubuntu(64位)

  • Base Memory: 1024 MB
  • 基本内存:1024 MB

  • Video Memory: 128 MB
  • 视频内存:128 MB

  • Storage (Virtual Size): 8.00 GB Dynamically allocated
  • 存储(虚拟大小):8.00 GB动态分配

I'm using iPython (version 2.7.6), and accessing it through a Linux terminal window.

我正在使用iPython(版本2.7.6),并通过Linux终端窗口访问它。

1 个解决方案

#1


1  

On almost any modern machine, a 213MB file is very tiny and easily fits into memory. I've loaded larger tweet datasets into memory on average modern machines. But perhaps you (or someone else reading this later) aren't working on a modern machine, or it is a modern machine with an especially small memory capacity.

几乎在任何现代机器上,213MB的文件非常小,很容易融入内存。我在普通的现代机器上将更大的推文数据集加载到内存中。但也许你(或其他人后来阅读)不是在现代机器上工作,或者它是一台内存容量特别小的现代机器。

If it is indeed the size of the data causing the segmentation fault, then you may try the ijson module for iterating over chunks of the JSON document.

如果它确实是导致分段错误的数据大小,那么您可以尝试使用ijson模块迭代JSON文档的块。

Here's an example from that project's page:

以下是该项目页面的示例:

import ijson

parser = ijson.parse(urlopen('http://.../'))
stream.write('<geo>')
for prefix, event, value in parser:
    if (prefix, event) == ('earth', 'map_key'):
        stream.write('<%s>' % value)
        continent = value
    elif prefix.endswith('.name'):
        stream.write('<object name="%s"/>' % value)
    elif (prefix, event) == ('earth.%s' % continent, 'end_map'):
        stream.write('</%s>' % continent)
stream.write('</geo>')

#1


1  

On almost any modern machine, a 213MB file is very tiny and easily fits into memory. I've loaded larger tweet datasets into memory on average modern machines. But perhaps you (or someone else reading this later) aren't working on a modern machine, or it is a modern machine with an especially small memory capacity.

几乎在任何现代机器上,213MB的文件非常小,很容易融入内存。我在普通的现代机器上将更大的推文数据集加载到内存中。但也许你(或其他人后来阅读)不是在现代机器上工作,或者它是一台内存容量特别小的现代机器。

If it is indeed the size of the data causing the segmentation fault, then you may try the ijson module for iterating over chunks of the JSON document.

如果它确实是导致分段错误的数据大小,那么您可以尝试使用ijson模块迭代JSON文档的块。

Here's an example from that project's page:

以下是该项目页面的示例:

import ijson

parser = ijson.parse(urlopen('http://.../'))
stream.write('<geo>')
for prefix, event, value in parser:
    if (prefix, event) == ('earth', 'map_key'):
        stream.write('<%s>' % value)
        continent = value
    elif prefix.endswith('.name'):
        stream.write('<object name="%s"/>' % value)
    elif (prefix, event) == ('earth.%s' % continent, 'end_map'):
        stream.write('</%s>' % continent)
stream.write('</geo>')