解析大型XML文件w/ Ruby和Nokogiri

时间:2021-11-12 00:50:55

I have a large XML file (about 10K rows) I need to parse regularly that is in this format:

我有一个大的XML文件(大约10K行)我需要定期解析这种格式:

<summarysection>
    <totalcount>10000</totalcount>
</summarysection>
<items>
     <item>
         <cat>Category</cat>
         <name>Name 1</name>
         <value>Val 1</value>
     </item>
     ...... 10,000 more times
</items>

What I'd like to do is parse each of the individual nodes using nokogiri to count the amount of items in one category. Then, I'd like to subtract that number from the total_count to get an ouput that reads "Count of Interest_Category: n, Count of All Else: z".

我想要做的是使用nokogiri解析每个节点,以计算一个类别中的项目数量。然后,我想从total_count中减去该数字,得到一个读数为“Count of Interest_Category:n,Count of All Else:z”的输出。

This is my code now:

这是我现在的代码:

#!/usr/bin/ruby

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0 
xmlfeed = Nokogiri::XML(open("/path/to/file/all.xml"))
all_items = xmlfeed.xpath("//items")

  all_items.each do |adv|
            if (adv.children.filter("cat").first.child.inner_text.include? "partofcatname")
                icount = icount + 1
            end
  end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount 

puts icount
puts othercount

This seems to work, but is very slow! I'm talking more than 10 minutes for 10,000 items. Is there a better way to do this? Am I doing something in a less than optimal fashion?

这似乎有效,但速度很慢!对于10,000件物品,我说的时间超过10分钟。有一个更好的方法吗?我是以不太理想的方式做事吗?

5 个解决方案

#1


3  

You can dramatically decrease your time to execute by changing your code to the following. Just change the "99" to whatever category you want to check.:

您可以通过将代码更改为以下内容来大幅缩短执行时间。只需将“99”更改为您要检查的任何类别:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0 
xmlfeed = Nokogiri::XML(open("test.xml"))
items = xmlfeed.xpath("//item")
items.each do |item|
  text = item.children.children.first.text  
  if ( text =~ /99/ )
    icount += 1
  end
end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount 

puts icount
puts othercount

This took about three seconds on my machine. I think a key error you made was that you chose the "items" iterate over instead of creating a collection of the "item" nodes. That made your iteration code awkward and slow.

这在我的机器上花了大约三秒钟。我认为你犯的一个关键错误就是你选择“items”迭代而不是创建“item”节点的集合。这使得你的迭代代码变得笨拙和缓慢。

#2


23  

Here's an example comparing a SAX parser count with a DOM-based count, counting 500,000 <item>s with one of seven categories. First, the output:

这是一个将SAX解析器计数与基于DOM的计数进行比较的示例,使用七个类别中的一个计算500,000 。一,输出:

Create XML file: 1.7s
Count via SAX: 12.9s
Create DOM: 1.6s
Count via DOM: 2.5s

创建XML文件:1.7s通过SAX计数:12.9s创建DOM:1.6s通过DOM计数:2.5s

Both techniques produce the same hash counting the number of each category seen:

这两种技术都产生相同的哈希值,计算每个类别的数量:

{"Cats"=>71423, "Llamas"=>71290, "Pigs"=>71730, "Sheep"=>71491, "Dogs"=>71331, "Cows"=>71536, "Hogs"=>71199}

The SAX version takes 12.9s to count and categorize, while the DOM version takes only 1.6s to create the DOM elements and 2.5s more to find and categorize all the <cat> values. The DOM version is around 3x as fast!

SAX版本需要12.9秒才能进行计数和分类,而DOM版本只需1.6秒即可创建DOM元素,需要2.5秒才能查找并分类所有 值。 DOM版本的速度快3倍!

…but that's not the entire story. We have to look at RAM usage as well.

......但这不是整个故事。我们还要看看RAM的使用情况。

  • For 500,000 items SAX (12.9s) peaks at 238MB of RAM; DOM (4.1s) peaks at 1.0GB.
  • 对于500,000件产品,SAX(12.9s)达到238MB RAM; DOM(4.1s)峰值为1.0GB。
  • For 1,000,000 items SAX (25.5s) peaks at 243MB of RAM; DOM (8.1s) peaks at 2.0GB.
  • 对于1,000,000个项目,SAX(25.5s)达到243MB RAM; DOM(8.1s)达到2.0GB。
  • For 2,000,000 items SAX (55.1s) peaks at 250MB of RAM; DOM (???) peaks at 3.2GB.
  • 对于2,000,000件产品,SAX(55.1s)达到250MB RAM; DOM(???)峰值为3.2GB。

I had enough memory on my machine to handle 1,000,000 items, but at 2,000,000 I ran out of RAM and had to start using virtual memory. Even with an SSD and a fast machine I let the DOM code run for almost ten minutes before finally killing it.

我的机器上有足够的内存来处理1,000,000个项目,但是在2,000,000时我用完了内存,不得不开始使用虚拟内存。即使使用SSD和快速机器,我还是让DOM代码运行了将近十分钟才能最终杀死它。

It is very likely that the long times you are reporting are because you are running out of RAM and hitting the disk continuously as part of virtual memory. If you can fit the DOM into memory, use it, as it is FAST. If you can't, however, you really have to use the SAX version.

您报告的时间很长很可能是因为您的RAM耗尽并且作为虚拟内存的一部分连续点击磁盘。如果您可以将DOM放入内存中,请使用它,因为它很快。但是,如果不能,则必须使用SAX版本。

Here's the test code:

这是测试代码:

require 'nokogiri'

CATEGORIES = %w[ Cats Dogs Hogs Cows Sheep Pigs Llamas ]
ITEM_COUNT = 500_000

def test!
  create_xml
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_sax
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_dom
end

def time(label)
  t1 = Time.now
  yield.tap{ puts "%s: %.1fs" % [ label, Time.now-t1 ] }
end

def test_sax
  item_counts = time("Count via SAX") do
    counter = CategoryCounter.new
    # Use parse_file so we can stream data from disk instead of flooding RAM
    Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')
    counter.category_counts
  end
  # p item_counts
end

def test_dom
  doc = time("Create DOM"){ File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) } }
  counts = time("Count via DOM") do
    counts = Hash.new(0)
    doc.xpath('//cat').each do |cat|
      counts[cat.children[0].content] += 1
    end
    counts
  end
  # p counts
end

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

def create_xml
  time("Create XML file") do
    File.open('tmp.xml','w') do |f|
      f << "<root>
      <summarysection><totalcount>10000</totalcount></summarysection>
      <items>
      #{
        ITEM_COUNT.times.map{ |i|
          "<item>
            <cat>#{CATEGORIES.sample}</cat>
            <name>Name #{i}</name>
            <name>Value #{i}</name>
          </item>"
        }.join("\n")
      }
      </items>
      </root>"
    end
  end
end

test! if __FILE__ == $0

How does the DOM Counting Work?

If we strip away some of the test structure, the DOM-based counter looks like this:

如果我们剥离一些测试结构,基于DOM的计数器看起来像这样:

# Open the file on disk and pass it to Nokogiri so that it can stream read;
# Better than  doc = Nokogiri.XML(IO.read('tmp.xml'))
# which requires us to load a huge string into memory just to parse it
doc = File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) }

# Create a hash with default '0' values for any 'missing' keys
counts = Hash.new(0) 

# Find every `<cat>` element in the document (assumes one per <item>)
doc.xpath('//cat').each do |cat|
  # Get the child text node's content and use it as the key to the hash
  counts[cat.children[0].content] += 1
end

How does the SAX counting Work?

First, let's focus on this code:

首先,让我们关注这段代码:

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

When we create a new instance of this class we get an object that has a Hash that defaults to 0 for all values, and a couple of methods that can be called on it. The SAX Parser will call these methods as it runs through the document.

当我们创建这个类的一个新实例时,我们得到一个对象,它具有一个Hash,对于所有值默认为0,以及可以在其上调用的几个方法。 SAX Parser将在文档中运行时调用这些方法。

  • Each time the SAX parser sees a new element it will call the start_element method on this class. When that happens, we set a flag based on whether this element is named "cat" or not (so that we can find the name of it later).

    每次SAX解析器看到一个新元素时,它将调用此类的start_element方法。当发生这种情况时,我们根据这个元素是否被命名为“cat”来设置一个标志(以便我们稍后可以找到它的名称)。

  • Each time the SAX parser slurps up a chunk of text it calls the characters method of our object. When that happens, we check to see if the last element we saw was a category (i.e. if @count was set to true); if so, we use the value of this text node as the category name and add one to our counter.

    每当SAX解析器啜饮一大块文本时,它就会调用对象的characters方法。当发生这种情况时,我们检查我们看到的最后一个元素是否是一个类别(即@count是否设置为true);如果是这样,我们使用此文本节点的值作为类别名称,并将一个值添加到我们的计数器。

To use our custom object with Nokogiri's SAX parser we do this:

要在Nokogiri的SAX解析器中使用我们的自定义对象,我们这样做:

# Create a new instance, with its empty hash
counter = CategoryCounter.new

# Create a new parser that will call methods on our object, and then
# use `parse_file` so that it streams data from disk instead of flooding RAM
Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')

# Once that's done, we can get the hash of category counts back from our object
counts = counter.category_counts
p counts["Pigs"]

#3


3  

I'd recommend using a SAX parser rather than a DOM parser for a file this large. Nokogiri has a nice SAX parser built in: http://nokogiri.org/Nokogiri/XML/SAX.html

我建议使用SAX解析器而不是DOM解析器来处理这么大的文件。 Nokogiri有一个很好的SAX解析器内置:http://nokogiri.org/Nokogiri/XML/SAX.html

The SAX way of doing things is nice for large files simply because it doesn't build a giant DOM tree, which in your case is overkill; you can build up your own structures when events fire (for counting nodes, for example).

对于大型文件,SAX的处理方式很好,因为它不会构建一个巨大的DOM树,在你的情况下是过度的;您可以在事件触发时构建自己的结构(例如,用于计算节点)。

#4


0  

Check out Greg Weber's version of Paul Dix's sax-machine gem: http://blog.gregweber.info/posts/2011-06-03-high-performance-rb-part1

查看Greg Weber版本的Paul Dix的萨克斯机器宝石:http://blog.gregweber.info/posts/2011-06-03-high-performance-rb-part1

Parsing large file with SaxMachine seems to be loading the whole file into memory

使用SaxMachine解析大文件似乎将整个文件加载到内存中

sax-machine makes the code much much simpler; Greg's variant makes it stream.

sax-machine使代码更加简单; Greg的变体让它变得流畅。

#5


0  

you may like to try this out - https://github.com/amolpujari/reading-huge-xml

你可能想尝试一下 - https://github.com/amolpujari/reading-huge-xml

HugeXML.read xml, elements_lookup do |element| # => element{ :name, :value, :attributes} end

HugeXML.read xml,elements_lookup do | element | #=>元素{:name,:value,:attributes}结束

I also tried using ox

我也试过用牛

#1


3  

You can dramatically decrease your time to execute by changing your code to the following. Just change the "99" to whatever category you want to check.:

您可以通过将代码更改为以下内容来大幅缩短执行时间。只需将“99”更改为您要检查的任何类别:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0 
xmlfeed = Nokogiri::XML(open("test.xml"))
items = xmlfeed.xpath("//item")
items.each do |item|
  text = item.children.children.first.text  
  if ( text =~ /99/ )
    icount += 1
  end
end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount 

puts icount
puts othercount

This took about three seconds on my machine. I think a key error you made was that you chose the "items" iterate over instead of creating a collection of the "item" nodes. That made your iteration code awkward and slow.

这在我的机器上花了大约三秒钟。我认为你犯的一个关键错误就是你选择“items”迭代而不是创建“item”节点的集合。这使得你的迭代代码变得笨拙和缓慢。

#2


23  

Here's an example comparing a SAX parser count with a DOM-based count, counting 500,000 <item>s with one of seven categories. First, the output:

这是一个将SAX解析器计数与基于DOM的计数进行比较的示例,使用七个类别中的一个计算500,000 。一,输出:

Create XML file: 1.7s
Count via SAX: 12.9s
Create DOM: 1.6s
Count via DOM: 2.5s

创建XML文件:1.7s通过SAX计数:12.9s创建DOM:1.6s通过DOM计数:2.5s

Both techniques produce the same hash counting the number of each category seen:

这两种技术都产生相同的哈希值,计算每个类别的数量:

{"Cats"=>71423, "Llamas"=>71290, "Pigs"=>71730, "Sheep"=>71491, "Dogs"=>71331, "Cows"=>71536, "Hogs"=>71199}

The SAX version takes 12.9s to count and categorize, while the DOM version takes only 1.6s to create the DOM elements and 2.5s more to find and categorize all the <cat> values. The DOM version is around 3x as fast!

SAX版本需要12.9秒才能进行计数和分类,而DOM版本只需1.6秒即可创建DOM元素,需要2.5秒才能查找并分类所有 值。 DOM版本的速度快3倍!

…but that's not the entire story. We have to look at RAM usage as well.

......但这不是整个故事。我们还要看看RAM的使用情况。

  • For 500,000 items SAX (12.9s) peaks at 238MB of RAM; DOM (4.1s) peaks at 1.0GB.
  • 对于500,000件产品,SAX(12.9s)达到238MB RAM; DOM(4.1s)峰值为1.0GB。
  • For 1,000,000 items SAX (25.5s) peaks at 243MB of RAM; DOM (8.1s) peaks at 2.0GB.
  • 对于1,000,000个项目,SAX(25.5s)达到243MB RAM; DOM(8.1s)达到2.0GB。
  • For 2,000,000 items SAX (55.1s) peaks at 250MB of RAM; DOM (???) peaks at 3.2GB.
  • 对于2,000,000件产品,SAX(55.1s)达到250MB RAM; DOM(???)峰值为3.2GB。

I had enough memory on my machine to handle 1,000,000 items, but at 2,000,000 I ran out of RAM and had to start using virtual memory. Even with an SSD and a fast machine I let the DOM code run for almost ten minutes before finally killing it.

我的机器上有足够的内存来处理1,000,000个项目,但是在2,000,000时我用完了内存,不得不开始使用虚拟内存。即使使用SSD和快速机器,我还是让DOM代码运行了将近十分钟才能最终杀死它。

It is very likely that the long times you are reporting are because you are running out of RAM and hitting the disk continuously as part of virtual memory. If you can fit the DOM into memory, use it, as it is FAST. If you can't, however, you really have to use the SAX version.

您报告的时间很长很可能是因为您的RAM耗尽并且作为虚拟内存的一部分连续点击磁盘。如果您可以将DOM放入内存中,请使用它,因为它很快。但是,如果不能,则必须使用SAX版本。

Here's the test code:

这是测试代码:

require 'nokogiri'

CATEGORIES = %w[ Cats Dogs Hogs Cows Sheep Pigs Llamas ]
ITEM_COUNT = 500_000

def test!
  create_xml
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_sax
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_dom
end

def time(label)
  t1 = Time.now
  yield.tap{ puts "%s: %.1fs" % [ label, Time.now-t1 ] }
end

def test_sax
  item_counts = time("Count via SAX") do
    counter = CategoryCounter.new
    # Use parse_file so we can stream data from disk instead of flooding RAM
    Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')
    counter.category_counts
  end
  # p item_counts
end

def test_dom
  doc = time("Create DOM"){ File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) } }
  counts = time("Count via DOM") do
    counts = Hash.new(0)
    doc.xpath('//cat').each do |cat|
      counts[cat.children[0].content] += 1
    end
    counts
  end
  # p counts
end

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

def create_xml
  time("Create XML file") do
    File.open('tmp.xml','w') do |f|
      f << "<root>
      <summarysection><totalcount>10000</totalcount></summarysection>
      <items>
      #{
        ITEM_COUNT.times.map{ |i|
          "<item>
            <cat>#{CATEGORIES.sample}</cat>
            <name>Name #{i}</name>
            <name>Value #{i}</name>
          </item>"
        }.join("\n")
      }
      </items>
      </root>"
    end
  end
end

test! if __FILE__ == $0

How does the DOM Counting Work?

If we strip away some of the test structure, the DOM-based counter looks like this:

如果我们剥离一些测试结构,基于DOM的计数器看起来像这样:

# Open the file on disk and pass it to Nokogiri so that it can stream read;
# Better than  doc = Nokogiri.XML(IO.read('tmp.xml'))
# which requires us to load a huge string into memory just to parse it
doc = File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) }

# Create a hash with default '0' values for any 'missing' keys
counts = Hash.new(0) 

# Find every `<cat>` element in the document (assumes one per <item>)
doc.xpath('//cat').each do |cat|
  # Get the child text node's content and use it as the key to the hash
  counts[cat.children[0].content] += 1
end

How does the SAX counting Work?

First, let's focus on this code:

首先,让我们关注这段代码:

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

When we create a new instance of this class we get an object that has a Hash that defaults to 0 for all values, and a couple of methods that can be called on it. The SAX Parser will call these methods as it runs through the document.

当我们创建这个类的一个新实例时,我们得到一个对象,它具有一个Hash,对于所有值默认为0,以及可以在其上调用的几个方法。 SAX Parser将在文档中运行时调用这些方法。

  • Each time the SAX parser sees a new element it will call the start_element method on this class. When that happens, we set a flag based on whether this element is named "cat" or not (so that we can find the name of it later).

    每次SAX解析器看到一个新元素时,它将调用此类的start_element方法。当发生这种情况时,我们根据这个元素是否被命名为“cat”来设置一个标志(以便我们稍后可以找到它的名称)。

  • Each time the SAX parser slurps up a chunk of text it calls the characters method of our object. When that happens, we check to see if the last element we saw was a category (i.e. if @count was set to true); if so, we use the value of this text node as the category name and add one to our counter.

    每当SAX解析器啜饮一大块文本时,它就会调用对象的characters方法。当发生这种情况时,我们检查我们看到的最后一个元素是否是一个类别(即@count是否设置为true);如果是这样,我们使用此文本节点的值作为类别名称,并将一个值添加到我们的计数器。

To use our custom object with Nokogiri's SAX parser we do this:

要在Nokogiri的SAX解析器中使用我们的自定义对象,我们这样做:

# Create a new instance, with its empty hash
counter = CategoryCounter.new

# Create a new parser that will call methods on our object, and then
# use `parse_file` so that it streams data from disk instead of flooding RAM
Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')

# Once that's done, we can get the hash of category counts back from our object
counts = counter.category_counts
p counts["Pigs"]

#3


3  

I'd recommend using a SAX parser rather than a DOM parser for a file this large. Nokogiri has a nice SAX parser built in: http://nokogiri.org/Nokogiri/XML/SAX.html

我建议使用SAX解析器而不是DOM解析器来处理这么大的文件。 Nokogiri有一个很好的SAX解析器内置:http://nokogiri.org/Nokogiri/XML/SAX.html

The SAX way of doing things is nice for large files simply because it doesn't build a giant DOM tree, which in your case is overkill; you can build up your own structures when events fire (for counting nodes, for example).

对于大型文件,SAX的处理方式很好,因为它不会构建一个巨大的DOM树,在你的情况下是过度的;您可以在事件触发时构建自己的结构(例如,用于计算节点)。

#4


0  

Check out Greg Weber's version of Paul Dix's sax-machine gem: http://blog.gregweber.info/posts/2011-06-03-high-performance-rb-part1

查看Greg Weber版本的Paul Dix的萨克斯机器宝石:http://blog.gregweber.info/posts/2011-06-03-high-performance-rb-part1

Parsing large file with SaxMachine seems to be loading the whole file into memory

使用SaxMachine解析大文件似乎将整个文件加载到内存中

sax-machine makes the code much much simpler; Greg's variant makes it stream.

sax-machine使代码更加简单; Greg的变体让它变得流畅。

#5


0  

you may like to try this out - https://github.com/amolpujari/reading-huge-xml

你可能想尝试一下 - https://github.com/amolpujari/reading-huge-xml

HugeXML.read xml, elements_lookup do |element| # => element{ :name, :value, :attributes} end

HugeXML.read xml,elements_lookup do | element | #=>元素{:name,:value,:attributes}结束

I also tried using ox

我也试过用牛