Hello I'm quite new to R and I'm trying to scrape a web site for some data. The problem is that the data is stored inconsistently.
您好我是R的新手,我正在尝试抓取一个网站获取一些数据。问题是数据存储不一致。
Sometimes I see:
有时我看到:
<div class = "text"> The text I want </div>
And other times I see:
有时我看到:
<div class = "text"><div class = "text"> The text I want </div></div>
So far I'm using the XML package and the following R code:
到目前为止,我正在使用XML包和以下R代码:
doc = htmlTreeParse(url, useInternalNodes = T)
text = xpathSApply(doc, "//*/div[@class='text']", xmlValue)
The problem is that this code will count "The text I want" twice when it comes across the second example, because it finds the <div class>
attribute twice. I only want to count it once because it only appears once.
问题是,当遇到第二个示例时,此代码将计算“我想要的文本”两次,因为它会找到两次
Any pointers are greatly appreciated!
任何指针都非常感谢!
2 个解决方案
#1
2
xtext <- "<div class = \"text\"> The text I want </div>
</div><div class = \"text\"><div class = \"text\"> The text I want </div></div>"
doc <- htmlParse(xtext)
xpathSApply(doc,"//*/div[@class='text']/text()")
#[[1]]
# The text I want
#[[2]]
# The text I want
#2
2
If you just want to count occurrences, then you should be able to find all nodes
如果您只想计算出现次数,那么您应该能够找到所有节点
all_text <- xpathSApply(doc, "//*/div[@class='text']", xmlValue)
and doubled nodes
和加倍的节点
doubled_text <- xpathSApply(doc, "//*/div[@class='text']/div[@class='text']", xmlValue)
then subtract the length of one from the other to get a true reflection.
然后从另一个中减去一个的长度以获得真实的反射。
#1
2
xtext <- "<div class = \"text\"> The text I want </div>
</div><div class = \"text\"><div class = \"text\"> The text I want </div></div>"
doc <- htmlParse(xtext)
xpathSApply(doc,"//*/div[@class='text']/text()")
#[[1]]
# The text I want
#[[2]]
# The text I want
#2
2
If you just want to count occurrences, then you should be able to find all nodes
如果您只想计算出现次数,那么您应该能够找到所有节点
all_text <- xpathSApply(doc, "//*/div[@class='text']", xmlValue)
and doubled nodes
和加倍的节点
doubled_text <- xpathSApply(doc, "//*/div[@class='text']/div[@class='text']", xmlValue)
then subtract the length of one from the other to get a true reflection.
然后从另一个中减去一个的长度以获得真实的反射。