使用XML packagin R解析RSS提要

时间:2021-09-10 00:25:29

I am trying to scrape and parse the following RSS feed http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml I have looked at other queries with respect to R and XML and have been unable to make any progress on my problem. The xml code for each entry

我正在尝试抓取并解析以下RSS提要http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml我已经查看了有关R和XML的其他查询,但无法取得任何进展在我的问题上。每个条目的xml代码

        <item>
     <title><![CDATA[Five Rockets Intercepted By Iron Drone Systems Over Be'er Sheva]]></title>
     <link>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</link>
     <description><![CDATA[<a href="http://www.haaretz.com/news/diplomacy-defense/live-blog-rockets-strike-tel-aviv-area-three-israelis-killed-in-attack-on-south-1.477960" target="_hplink">Haaretz reports</a> that five more rockets intercepted by Iron Dome systems over Be'er Sheva. In total, there have been 274 rockets fired and 105 intercepted. The IDF has attacked 250 targets in Gaza.]]></description>
     <guid>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</guid>
     <pubDate>2012-11-15T12:56:09-05:00</pubDate>
     <source url="http://huffingtonpost.com/rss/liveblog/liveblog-1213.xml">Huffingtonpost.com</source>
  </item>

For each entry/post I want to record "Date" (pubDate), "Title" (title), "Description" (full text cleaned). I have tried to use the xml package in R, but confess I am a bit of a newbie (little to no experience working with XML, but some R experience). The code I am working off of, and getting nowhere with is:

对于每个条目/帖子,我想记录“日期”(pubDate),“标题”(标题),“描述”(全文清除)。我曾尝试在R中使用xml包,但承认我是一个新手(很少有没有使用XML的经验,但有些R经验)。我正在处理的代码,无处可去的是:

 library(XML)

 xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"

 # Use the xmlTreePares-function to parse xml file directly from the web

 xmlfile <- xmlTreeParse(xml.url)

# Use the xmlRoot-function to access the top node

xmltop = xmlRoot(xmlfile)

xmlName(xmltop)

names( xmltop[[ 1 ]] )

  title          link   description      language     copyright 
  "title"        "link" "description"    "language"   "copyright" 
 category     generator          docs          item          item 
  "category"   "generator"        "docs"        "item"        "item"

However, whenever I try to manipulate and try to manipulate the "title", or "description" information, I continually get errors. Any help troubleshooting this code, would be most appreciated.

但是,每当我试图操纵并试图操纵“标题”或“描述”信息时,我就会不断地收到错误。任何帮助解决此代码的帮助,将是非常感谢。

Thanks, Thomas

谢谢,托马斯

1 个解决方案

#1


10  

I am using the excellent Rcurl library and xpathSApply

我正在使用优秀的Rcurl库和xpathSApply

This is script gives you 3 lists (title,pubdates and description)

这是脚本为您提供3个列表(标题,pubdates和描述)

library(RCurl)
library(XML)
xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"
script  <- getURL(xml.url)
doc     <- xmlParse(script)
titles    <- xpathSApply(doc,'//item/title',xmlValue)
descriptions    <- xpathSApply(doc,'//item/description',xmlValue)
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue)

#1


10  

I am using the excellent Rcurl library and xpathSApply

我正在使用优秀的Rcurl库和xpathSApply

This is script gives you 3 lists (title,pubdates and description)

这是脚本为您提供3个列表(标题,pubdates和描述)

library(RCurl)
library(XML)
xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"
script  <- getURL(xml.url)
doc     <- xmlParse(script)
titles    <- xpathSApply(doc,'//item/title',xmlValue)
descriptions    <- xpathSApply(doc,'//item/description',xmlValue)
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue)