使用Nokogiri和regex在Ruby XML文档中解析编码标签

I am trying to parse XML with tags embedded in tags, like this one using Nokigiri and Ruby:

我试图用嵌入在标签中的标签来解析XML,比如使用Nokigiri和Ruby的标签:

<seg>Trennmesser <ph>&lt;I.FIGREF ITEM=&quot;3&quot; FORMAT=&quot;PARENTHESIS&quot;&gt;</ph><bpt i="1">&lt;I.FIGTARGET TARGET=&quot;CIADDAJA&quot;&gt;</bpt><ept i="1">&lt;/I.FIGREF&gt;</ept></seg>

In this case I would only need the word "Trennmesser" not within the embedded tags.

在这种情况下,我只需要“嵌入式标签”中的“Trennmesser”一词。

In this second example:

在第二个例子中:

<seg>Hilfsmittel <ph>&lt;F34@Z7@Lge&gt;</ph>X <ph>&lt;F0&gt;</ph>= 0,5mm zwischen    Beschleunigerwalze <ph>&lt;F34@Z7@Lge&gt;</ph>D<ph>&lt;F0&gt;</ph> und Trennmesser schieben.</seg>

The words within the closed /ph and open ph tags are also interesting, so the regex would need to extract the string "Hilfsmittel 0,5mm zwischen Beschleunigerwalze und Trennmesser schieben." and discard everything else.

封闭/ ph和开放ph标签内的单词也很有趣,因此正则表达式需要提取字符串“Hilfsmittel 0,5mm zwischen Beschleunigerwalze und Trennmesser schieben”。并丢弃其他一切。

I have also uploaded a part of the document here:
http://pastebin.com/Q8CdnASz

我还在这里上传了部分文档:http://pastebin.com/Q8CdnASz

2 个解决方案

#1

Try this in irb

在irb中尝试这个

require 'nokogiri'
x = Nokogiri::XML.parse('<seg>Hilfsmittel <ph>&lt;F34@Z7@Lge&gt;</ph>X <ph>&lt;F0&gt;</ph>= 0,5mm zwischen    Beschleunigerwalze <ph>&lt;F34@Z7@Lge&gt;</ph>D<ph>&lt;F0&gt;</ph> und Trennmesser schieben.</seg>')
x.xpath('//seg').children.reject {|x| x.element?}.join {|x| x.content}

for me this outputs

对我来说这是输出

=> "Hilfsmittel X = 0,5mm zwischen    Beschleunigerwalze D und Trennmesser schieben."

The idea here is that we iterate over the children of the <seg> tag, rejecting the ones that are elements themselves (<ph>), which should leave only the content elements. Take the resultant array, and join the content elements together as one string.

这里的想法是我们迭代标签的子元素,拒绝那些元素本身( ),这应该只留下内容元素。获取结果数组,并将内容元素作为一个字符串连接在一起。

Note that the output is slightly different than you described, because there's an additional D and X in between two of the tags.

请注意,输出与您描述的略有不同,因为两个标签之间还有一个额外的D和X.

#2

The content inside the <ph> tags has been encoded to preserve the reserved characters < and >.

标记内的内容已经过编码,以保留保留字符 <和> 。

A clean way to deal with this is to let Nokogiri reparse those chunks back into XML:

解决这个问题的一个简单方法是让Nokogiri将这些块重新解析为XML:

require 'nokogiri'

doc = Nokogiri::XML('<seg>Trennmesser <ph>&lt;I.FIGREF ITEM=&quot;3&quot; FORMAT=&quot;PARENTHESIS&quot;&gt;</ph><bpt i="1">&lt;I.FIGTARGET TARGET=&quot;CIADDAJA&quot;&gt;</bpt><ept i="1">&lt;/I.FIGREF&gt;</ept></seg>')

ph = Nokogiri::XML::DocumentFragment.parse(doc.at('seg ph').content)
puts ph.to_xml

Which outputs the following node, showing Nokogiri recreated that fragment correctly:

其中输出以下节点,显示Nokogiri正确地重新创建了该片段:

<I.FIGREF ITEM="3" FORMAT="PARENTHESIS"/>

For extracting the text inside the <seg> tag:

用于提取标记内的文本:

doc.at('//seg/text()').text
=> "Trennmesser "

When dealing with HTML or XML, it's never good to presuppose that regex will be the best path to extracting something. Both HTML and XML are too irregular and "flexible" (where flexible means it's often irritatingly malformed or defined in totally unique and unexpected ways).

在处理HTML或XML时,预先假定正则表达式是提取某些东西的最佳途径,这绝不是好事。 HTML和XML都太不规则和“灵活”(灵活意味着它经常令人烦恼地变形或以完全独特和意想不到的方式定义)。

To get the full content inside the <seg> tag in the second question:

要在第二个问题中获取标记内的完整内容:

require 'nokogiri'

doc = Nokogiri::XML('<seg>Hilfsmittel <ph>&lt;F34@Z7@Lge&gt;</ph>X <ph>&lt;F0&gt;</ph>= 0,5mm zwischen    Beschleunigerwalze <ph>&lt;F34@Z7@Lge&gt;</ph>D<ph>&lt;F0&gt;</ph> und Trennmesser schieben.</seg>')

seg = Nokogiri::XML::DocumentFragment.parse(doc.at('seg').content)
puts seg.content

Which outputs:

Hilfsmittel @Z7@Lge>X = 0,5mm zwischen    Beschleunigerwalze @Z7@Lge>D und Trennmesser schieben.

#1