从大文件中读取JSON对象

时间:2022-09-15 13:02:30

I am looking for a JSON Parser that can allow me to iterate through JSON objects from a large JSON file (with size few hundreds of MBs). I tried JsonTextReader from Json.NET like below:

我正在寻找一个JSON Parser,它允许我从大型JSON文件(大小几百MB)中迭代JSON对象。我从Json.NET尝试过JsonTextReader,如下所示:

JsonTextReader reader = new JsonTextReader(new StringReader(json));
while (reader.Read())
{
    if (reader.Value != null)
       Console.WriteLine("Token: {0}, Value: {1}", reader.TokenType, reader.Value);
    else
       Console.WriteLine("Token: {0}", reader.TokenType);
}

But it returns token after token.
Is there any simpler way if I need whole object instead of tokens?

但它会在令牌之后返回令牌。如果我需要整个对象而不是令牌,有没有更简单的方法?

3 个解决方案

#1


2  

Let's assume you have a json array similar to this:

假设你有一个类似于这的json数组:

[{"text":"0"},{"text":"1"}......]

I'll declare a class for the object type

我将为对象类型声明一个类

public class TempClass
{
    public string text;
}

Now, the deserializetion part

现在,反序列化部分

JsonSerializer ser = new JsonSerializer();
ser.Converters.Add(new DummyConverter<TempClass>(t =>
    {
       //A callback method
        Console.WriteLine(t.text);
    }));

ser.Deserialize(new JsonTextReader(new StreamReader(File.OpenRead(fName))), 
                typeof(List<TempClass>));

And a dummy JsonConverter class to intercept the deserialization

还有一个虚拟的JsonConverter类来拦截反序列化

public class DummyConverter<T> : JsonConverter
{
    Action<T> _action = null;
    public DummyConverter(Action<T> action)
    {
        _action = action;
    }
    public override bool CanConvert(Type objectType)
    {
        return objectType == typeof(TempClass);
    }

    public override object ReadJson(JsonReader reader, Type objectType, object existingValue, JsonSerializer serializer)
    {
        serializer.Converters.Remove(this);
        T item = serializer.Deserialize<T>(reader);
        _action( item);
        return null;
    }

    public override void WriteJson(JsonWriter writer, object value, JsonSerializer serializer)
    {
        throw new NotImplementedException();
    }
}

#2


1  

I would use this library JSON.net. The command for Nuget is as follows -> Install-Package Newtonsoft.Json

我会使用这个库JSON.net。 Nuget的命令如下 - > Install-Package Newtonsoft.Json

#3


1  

This is one of the use cases I contemplated for my own parser/deserializer.

这是我为自己的解析器/解串器设想的用例之一。

I've recently made a simple example (by feeding the parser with JSON text that is read thru a StreamReader) of deserializing this JSON shape:

我最近做了一个简单的例子(通过向解析器提供通过StreamReader读取的JSON文本)反序列化这个JSON形状:

{ 
"fathers" : [ 
{ 
  "id" : 0,
  "married" : true,
  "name" : "John Lee",
  "sons" : [ 
    { 
      "age" : 15,
      "name" : "Ronald"
      }
    ],
  "daughters" : [ 
    { 
      "age" : 7,
      "name" : "Amy"
      },
    { 
      "age" : 29,
      "name" : "Carol"
      },
    { 
      "age" : 14,
      "name" : "Barbara"
      }
    ]
  },
{ 
  "id" : 1,
  "married" : false,
  "name" : "Kenneth Gonzalez",
  "sons" : [
    ],
  "daughters" : [
    ]
  },
{ 
  "id" : 2,
  "married" : false,
  "name" : "Larry Lee",
  "sons" : [ 
    { 
      "age" : 4,
      "name" : "Anthony"
      },
    { 
      "age" : 2,
      "name" : "Donald"
      }
    ],
  "daughters" : [ 
    { 
      "age" : 7,
      "name" : "Elizabeth"
      },
    { 
      "age" : 15,
      "name" : "Betty"
      }
    ]
  },

  //(... etc)
  ]
}

... into these POCOs:

......进入这些POCO:

https://github.com/ysharplanguage/FastJsonParser#POCOs

https://github.com/ysharplanguage/FastJsonParser#POCOs

(i.e., specifically: "FathersData", "Father", "Son", "Daughter")

(具体来说:“父亲数据”,“父亲”,“儿子”,“女儿”)

That sample also presents:

该样本还提出:

(1) a sample filter on the relative item index in the Father[] array (e.g., to fetch only the first 10), and

(1)在Father []数组中的相对项索引上的样本过滤器(例如,仅获取前10个),以及

(2) how to populate dynamically a property of the father's daughters, as the deserialization of their respective father returns - (that is, thanks to a delegate that the caller passes on to the parser's Parse method, for callback purposes).

(2)如何动态填充父亲的女儿的属性,因为他们各自父亲的反序列化返回 - (这要归功于调用者传递给解析器的Parse方法的委托,用于回调目的)。

For the rest of the bits, see:

对于其余位,请参阅:

ParserTests.cs : static void FilteredFatherStreamTestDaughterMaidenNamesFixup()

ParserTests.cs:static void FilteredFatherStreamTestDaughterMaidenNamesFixup()

(lines #829 to #904)

(#829到#904行)

The performance I observe on my humble laptop (*) for parsing some ~ 12MB to ~ 180MB JSON files and deserializing an arbitrary subset of their content into POCOs

我在简陋的笔记本电脑上观察到的性能(*)用于解析大约12MB到~180MB的JSON文件,并将其内容的任意子集反序列化为POCO

(or into loosely-typed dictionaries (just (string, object) key/value pairs) also supported)

(或支持松散类型的词典(只是(字符串,对象)键/值对)也支持)

is anywhere in the ballpark from ~ 20MB/sec to 40MB/sec (**).

在球场的任何地方,从大约20MB /秒到40MB /秒(**)。

(e.g., ~ 300 milliseconds in the case of the 12MB JSON file, into POCOs)

(例如,在12MB JSON文件的情况下,约为300毫秒,进入POCO)

More detailed info available here:

更多详细信息在这里:

https://github.com/ysharplanguage/FastJsonParser#Performance

https://github.com/ysharplanguage/FastJsonParser#Performance

'HTH,

“HTH,

(*) (running Win7 64bit @ 2.5Ghz)

(*)(运行Win7 64bit @ 2.5Ghz)

(**) (the throughput is quite dependent on the input JSON shape/complexity, e.g., sub-objects nesting depth, and other factors)

(**)(吞吐量完全取决于输入JSON形状/复杂度,例如,子对象嵌套深度和其他因素)

#1


2  

Let's assume you have a json array similar to this:

假设你有一个类似于这的json数组:

[{"text":"0"},{"text":"1"}......]

I'll declare a class for the object type

我将为对象类型声明一个类

public class TempClass
{
    public string text;
}

Now, the deserializetion part

现在,反序列化部分

JsonSerializer ser = new JsonSerializer();
ser.Converters.Add(new DummyConverter<TempClass>(t =>
    {
       //A callback method
        Console.WriteLine(t.text);
    }));

ser.Deserialize(new JsonTextReader(new StreamReader(File.OpenRead(fName))), 
                typeof(List<TempClass>));

And a dummy JsonConverter class to intercept the deserialization

还有一个虚拟的JsonConverter类来拦截反序列化

public class DummyConverter<T> : JsonConverter
{
    Action<T> _action = null;
    public DummyConverter(Action<T> action)
    {
        _action = action;
    }
    public override bool CanConvert(Type objectType)
    {
        return objectType == typeof(TempClass);
    }

    public override object ReadJson(JsonReader reader, Type objectType, object existingValue, JsonSerializer serializer)
    {
        serializer.Converters.Remove(this);
        T item = serializer.Deserialize<T>(reader);
        _action( item);
        return null;
    }

    public override void WriteJson(JsonWriter writer, object value, JsonSerializer serializer)
    {
        throw new NotImplementedException();
    }
}

#2


1  

I would use this library JSON.net. The command for Nuget is as follows -> Install-Package Newtonsoft.Json

我会使用这个库JSON.net。 Nuget的命令如下 - > Install-Package Newtonsoft.Json

#3


1  

This is one of the use cases I contemplated for my own parser/deserializer.

这是我为自己的解析器/解串器设想的用例之一。

I've recently made a simple example (by feeding the parser with JSON text that is read thru a StreamReader) of deserializing this JSON shape:

我最近做了一个简单的例子(通过向解析器提供通过StreamReader读取的JSON文本)反序列化这个JSON形状:

{ 
"fathers" : [ 
{ 
  "id" : 0,
  "married" : true,
  "name" : "John Lee",
  "sons" : [ 
    { 
      "age" : 15,
      "name" : "Ronald"
      }
    ],
  "daughters" : [ 
    { 
      "age" : 7,
      "name" : "Amy"
      },
    { 
      "age" : 29,
      "name" : "Carol"
      },
    { 
      "age" : 14,
      "name" : "Barbara"
      }
    ]
  },
{ 
  "id" : 1,
  "married" : false,
  "name" : "Kenneth Gonzalez",
  "sons" : [
    ],
  "daughters" : [
    ]
  },
{ 
  "id" : 2,
  "married" : false,
  "name" : "Larry Lee",
  "sons" : [ 
    { 
      "age" : 4,
      "name" : "Anthony"
      },
    { 
      "age" : 2,
      "name" : "Donald"
      }
    ],
  "daughters" : [ 
    { 
      "age" : 7,
      "name" : "Elizabeth"
      },
    { 
      "age" : 15,
      "name" : "Betty"
      }
    ]
  },

  //(... etc)
  ]
}

... into these POCOs:

......进入这些POCO:

https://github.com/ysharplanguage/FastJsonParser#POCOs

https://github.com/ysharplanguage/FastJsonParser#POCOs

(i.e., specifically: "FathersData", "Father", "Son", "Daughter")

(具体来说:“父亲数据”,“父亲”,“儿子”,“女儿”)

That sample also presents:

该样本还提出:

(1) a sample filter on the relative item index in the Father[] array (e.g., to fetch only the first 10), and

(1)在Father []数组中的相对项索引上的样本过滤器(例如,仅获取前10个),以及

(2) how to populate dynamically a property of the father's daughters, as the deserialization of their respective father returns - (that is, thanks to a delegate that the caller passes on to the parser's Parse method, for callback purposes).

(2)如何动态填充父亲的女儿的属性,因为他们各自父亲的反序列化返回 - (这要归功于调用者传递给解析器的Parse方法的委托,用于回调目的)。

For the rest of the bits, see:

对于其余位,请参阅:

ParserTests.cs : static void FilteredFatherStreamTestDaughterMaidenNamesFixup()

ParserTests.cs:static void FilteredFatherStreamTestDaughterMaidenNamesFixup()

(lines #829 to #904)

(#829到#904行)

The performance I observe on my humble laptop (*) for parsing some ~ 12MB to ~ 180MB JSON files and deserializing an arbitrary subset of their content into POCOs

我在简陋的笔记本电脑上观察到的性能(*)用于解析大约12MB到~180MB的JSON文件,并将其内容的任意子集反序列化为POCO

(or into loosely-typed dictionaries (just (string, object) key/value pairs) also supported)

(或支持松散类型的词典(只是(字符串,对象)键/值对)也支持)

is anywhere in the ballpark from ~ 20MB/sec to 40MB/sec (**).

在球场的任何地方,从大约20MB /秒到40MB /秒(**)。

(e.g., ~ 300 milliseconds in the case of the 12MB JSON file, into POCOs)

(例如,在12MB JSON文件的情况下,约为300毫秒,进入POCO)

More detailed info available here:

更多详细信息在这里:

https://github.com/ysharplanguage/FastJsonParser#Performance

https://github.com/ysharplanguage/FastJsonParser#Performance

'HTH,

“HTH,

(*) (running Win7 64bit @ 2.5Ghz)

(*)(运行Win7 64bit @ 2.5Ghz)

(**) (the throughput is quite dependent on the input JSON shape/complexity, e.g., sub-objects nesting depth, and other factors)

(**)(吞吐量完全取决于输入JSON形状/复杂度,例如,子对象嵌套深度和其他因素)