使用Nokogiri解析XML文件以确定路径(Ruby)

时间:2021-07-24 01:16:53

My code is supposed to "guess" the path(s) that lies before the relevant text nodes in my XML file. Relevant in this case means: text nodes nested within the recurring product/person/something tag, but not text nodes that are used outside of it.

我的代码应该“猜测”我的XML文件中相关文本节点之前的路径。在这种情况下相关意味着:文本节点嵌套在重复产品/ person / something标记内,但不包含在其外部使用的文本节点。

This code:

    @doc, items = Nokogiri.XML(@file), []

    path = []
    @doc.traverse do |node|
      if node.class.to_s == "Nokogiri::XML::Element"
        is_path_element = false
        node.children.each do |child|
          is_path_element = true if child.class.to_s == "Nokogiri::XML::Element"
        end
        path.push(node.name) if is_path_element == true && !path.include?(node.name)
      end
    end
    final_path = "/"+path.reverse.join("/")

works for simple XML files, for example:

适用于简单的XML文件,例如:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Some XML file title</title>
    <description>Some XML file description</description>
    <item>
      <title>Some product title</title>
      <brand>Some product brand</brand>
    </item>
    <item>
      <title>Some product title</title>
      <brand>Some product brand</brand>
    </item>
  </channel>
</rss>

puts final_path # => "/rss/channel/item"

But when it gets more complicated, how should I then approach the challenge? For example with this one:

但是当它变得更加复杂时,我该如何应对挑战呢?例如,这个:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Some XML file title</title>
    <description>Some XML file description</description>
    <item>
      <titles>
        <title>Some product title</title>
      </titles>
      <brands>
        <brand>Some product brand</brand>
      </brands>
    </item>
    <item>
      <titles>
        <title>Some product title</title>
      </titles>
      <brands>
        <brand>Some product brand</brand>
      </brands>
    </item>
  </channel>
</rss>

1 个解决方案

#1


3  

If you are looking for a list of deepest "parent" paths in the XML, there is more than one way to view that.

如果您正在寻找XML中最深的“父”路径列表,则可以通过多种方式查看该路径。

Although I think your own code could be adjusted to achieve the same output, I was convinced the same thing could be achieved by using xpath. And my motivation is to get my XML skills unrusty (not used Nokogiri yet, but I will need to do so professionally soon). So here is how to get all parent paths that have just one child level beneath them, using xpath:

虽然我认为你自己的代码可以调整以实现相同的输出,但我确信使用xpath可以实现同样的目的。我的动机是让我的XML技能变得不公平(还没有使用过Nokogiri,但我很快就会需要专业的)。所以这里是如何使用xpath获取其下只有一个子级别的所有父路径:

xml.xpath('//*[child::* and not(child::*/*)]').each { |node| puts node.path }

The output of this for your second example file is:

您的第二个示例文件的输出是:

/rss/channel/item[1]/titles
/rss/channel/item[1]/brands
/rss/channel/item[2]/titles
/rss/channel/item[2]/brands

. . . if you took this list and gsub out the indexes, then make the array unique, then this looks a lot like the output of your loop . . .

。 。 。如果您使用此列表并gsub输出索引,然后使数组唯一,那么这看起来很像循环的输出。 。 。

paths = xml.xpath('//*[child::* and not(child::*/*)]').map { |node| node.path }
paths.map! { |path| path.gsub(/\[[0-9]+\]/,'') }.uniq!
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]

Or in one line:

或者在一行中:

paths = xml.xpath('//*[* and not(*/*)]').map { |node| node.path.gsub(/\[[0-9]+\]/,'') }.uniq
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]

#1


3  

If you are looking for a list of deepest "parent" paths in the XML, there is more than one way to view that.

如果您正在寻找XML中最深的“父”路径列表,则可以通过多种方式查看该路径。

Although I think your own code could be adjusted to achieve the same output, I was convinced the same thing could be achieved by using xpath. And my motivation is to get my XML skills unrusty (not used Nokogiri yet, but I will need to do so professionally soon). So here is how to get all parent paths that have just one child level beneath them, using xpath:

虽然我认为你自己的代码可以调整以实现相同的输出,但我确信使用xpath可以实现同样的目的。我的动机是让我的XML技能变得不公平(还没有使用过Nokogiri,但我很快就会需要专业的)。所以这里是如何使用xpath获取其下只有一个子级别的所有父路径:

xml.xpath('//*[child::* and not(child::*/*)]').each { |node| puts node.path }

The output of this for your second example file is:

您的第二个示例文件的输出是:

/rss/channel/item[1]/titles
/rss/channel/item[1]/brands
/rss/channel/item[2]/titles
/rss/channel/item[2]/brands

. . . if you took this list and gsub out the indexes, then make the array unique, then this looks a lot like the output of your loop . . .

。 。 。如果您使用此列表并gsub输出索引,然后使数组唯一,那么这看起来很像循环的输出。 。 。

paths = xml.xpath('//*[child::* and not(child::*/*)]').map { |node| node.path }
paths.map! { |path| path.gsub(/\[[0-9]+\]/,'') }.uniq!
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]

Or in one line:

或者在一行中:

paths = xml.xpath('//*[* and not(*/*)]').map { |node| node.path.gsub(/\[[0-9]+\]/,'') }.uniq
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]