使用R来抓取嵌套的XML数据

Hello I'm quite new to R and I'm trying to scrape a web site for some data. The problem is that the data is stored inconsistently.

您好我是R的新手，我正在尝试抓取一个网站获取一些数据。问题是数据存储不一致。

Sometimes I see:

有时我看到：

<div class = "text">   The text I want   </div>

And other times I see:

有时我看到：

<div class = "text"><div class = "text">   The text I want   </div></div>

So far I'm using the XML package and the following R code:

到目前为止，我正在使用XML包和以下R代码：

doc = htmlTreeParse(url, useInternalNodes = T)
text = xpathSApply(doc, "//*/div[@class='text']", xmlValue)

The problem is that this code will count "The text I want" twice when it comes across the second example, because it finds the <div class> attribute twice. I only want to count it once because it only appears once.

问题是，当遇到第二个示例时，此代码将计算“我想要的文本”两次，因为它会找到两次

属性。我只想算一次因为它只出现一次。

Any pointers are greatly appreciated!

任何指针都非常感谢！

2 个解决方案

#1

xtext <- "<div class = \"text\">   The text I want   </div>
</div><div class = \"text\"><div class = \"text\">   The text I want   </div></div>"
doc <- htmlParse(xtext)
xpathSApply(doc,"//*/div[@class='text']/text()")

#[[1]]
#   The text I want    

#[[2]]
#   The text I want

#2

If you just want to count occurrences, then you should be able to find all nodes

如果您只想计算出现次数，那么您应该能够找到所有节点

all_text <- xpathSApply(doc, "//*/div[@class='text']", xmlValue)

and doubled nodes

和加倍的节点

doubled_text <- xpathSApply(doc, "//*/div[@class='text']/div[@class='text']", xmlValue)

then subtract the length of one from the other to get a true reflection.

然后从另一个中减去一个的长度以获得真实的反射。

#1

xtext <- "<div class = \"text\">   The text I want   </div>
</div><div class = \"text\"><div class = \"text\">   The text I want   </div></div>"
doc <- htmlParse(xtext)
xpathSApply(doc,"//*/div[@class='text']/text()")

#[[1]]
#   The text I want    

#[[2]]
#   The text I want

#2