使用XPath 1.0,如何让多个匿名函数对提取的内容进行操作?

时间:2022-10-07 23:34:28

With R, httr and XML you can scrape this site; the relevant HTML code is below.

有了R、httr和XML,您就可以对这个站点进行抓取;相关的HTML代码如下。

doc <- htmlTreeParse("http://www.mehaffyweber.com/Firm/Offices/", useInternal = TRUE)

<div id="content">
<img id="printLogo" style="padding-bottom:30px" src="/images/logo_print.jpg">
<div id="contentTitle">
<div style="height: 30px;">
<h1>Offices</h1>
<h3>Beaumont Location:</h3>
<p>
<p>
<br>
<h3>
<strong>Houston Location:</strong>
</h3>
<p>
<p>
<h3>
<strong>Austin Location:</strong>
</h3>

To extract only the cities where this company has offices, this XLPath 1.0 code works:

为了只提取公司办公地点所在的城市,XLPath 1.0代码可以工作:

(string <- xpathSApply(doc, "//h3", function(x) {
  gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE))})) 

I tried to paste the state to the city with a second anonymous function but failed:

我试图用第二个匿名函数将州粘贴到城市中,但失败了:

> (string <- xpathSApply(doc, "//h3", function(x) {
+   gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE))} &&
+     function(x) {paste0(xmlValue(x), " , TX")}))
Error in { : invalid 'x' type in 'x && y'

So did a simpler try when I did not repeat function(x)

我没有重复函数(x)

> (string <- xpathSApply(doc, "//h3", function(x) {
+   gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE)) &&
+     paste0(xmlValue(x), " , TX")}))
Error in gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE)) && paste0(xmlValue(x),  : 
  invalid 'x' type in 'x && y'

DESIRED OUTPUT: How might I combine both anonymous functions and create this string?

期望的输出:如何组合匿名函数并创建此字符串?

[1] "Beaumont, TX" "Houston, TX" "Austin, TX"

[1] "Beaumont, TX"休斯顿,TX"奥斯汀,TX"

3 个解决方案

#1


1  

A couple of things. htmlParse is shorthand for htmlTreeParse(..., useInternal = TRUE). You have issues with encoding on this document so the RCurl library will help to remove the strange encodings you are encountering.

几件事情。htmlParse是htmlTreeParse(…useInternal = TRUE)。您在这个文档上有编码的问题,所以RCurl库将帮助删除您遇到的奇怪的编码。

library(XML)
library(RCurl)
appHTML <- getURL("http://www.mehaffyweber.com/Firm/Offices/"
                  , .encoding = "UTF-8")
doc <- htmlParse(appHTML, encoding = "UTF-8")

xpathSApply is a shorthand for two operations. It applies the xpath to the doc and gets the relevant nodes. Then each of this nodes is applied to the function the user stipulates. The x passing to the function is basically the output from:

xpathSApply是两个操作的简写。它将xpath应用到doc并获取相关节点。然后将每个节点应用到用户规定的函数中。传递给函数的x基本上是:

getNodeSet(doc, "//h3")

or in shorthand

或简写

doc["//h3"]

Each element of doc["//h3"] is an internal XML node:

doc["//h3"]的每个元素都是一个内部XML节点:

> str(doc['//h3'])
List of 3
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 - attr(*, "class")= chr "XMLNodeSet"

So the x in your function is just like an element of doc["//h3"]. So you can experiment with doc["//h3"][[1]]

函数中的x就像doc["/ h3"]中的元素。可以用doc["/ h3"][[1]]进行实验

x<- doc['//h3'][[1]]
temp <- gsub("\\WLocation:", "", xmlValue(x))
paste0(temp, ", TX")
[1] "Beaumont, TX"

Then you can apply this logic in your function:

然后你可以把这个逻辑应用到你的函数中:

xpathSApply(doc, "//h3", function(x){
  temp <- gsub("\\WLocation:", "", xmlValue(x))
  paste0(temp, ", TX")
})

[1] "Beaumont, TX"         "Houston, TX"          "Austin, TX"

#2


1  

If you're willing to use rvest and stringr it's a pretty simple solution:

如果你愿意使用rvest和stringr,这是一个非常简单的解决方案:

library(rvest)
library(stringr)

pg <- html("http://www.mehaffyweber.com/Firm/Offices/")

found <- pg %>%
  html_nodes("#content") %>% 
  html_text() %>% 
  str_match_all("([[:alpha:]]+), Texas")  

sprintf("%s, TX", found[[1]][,2])

## [1] "Beaumont, TX" "Houston, TX"  "Austin, TX"  

#3


1  

You can use the following to get your desired result.

您可以使用以下内容来获得所需的结果。

string <- xpathSApply(doc, '//h3', function(x) {
        paste0(sub('^([A-Z][a-z]+).*', '\\1', xmlValue(x)), ', TX')
})
# [1] "Beaumont, TX" "Houston, TX"  "Austin, TX"  

#1


1  

A couple of things. htmlParse is shorthand for htmlTreeParse(..., useInternal = TRUE). You have issues with encoding on this document so the RCurl library will help to remove the strange encodings you are encountering.

几件事情。htmlParse是htmlTreeParse(…useInternal = TRUE)。您在这个文档上有编码的问题,所以RCurl库将帮助删除您遇到的奇怪的编码。

library(XML)
library(RCurl)
appHTML <- getURL("http://www.mehaffyweber.com/Firm/Offices/"
                  , .encoding = "UTF-8")
doc <- htmlParse(appHTML, encoding = "UTF-8")

xpathSApply is a shorthand for two operations. It applies the xpath to the doc and gets the relevant nodes. Then each of this nodes is applied to the function the user stipulates. The x passing to the function is basically the output from:

xpathSApply是两个操作的简写。它将xpath应用到doc并获取相关节点。然后将每个节点应用到用户规定的函数中。传递给函数的x基本上是:

getNodeSet(doc, "//h3")

or in shorthand

或简写

doc["//h3"]

Each element of doc["//h3"] is an internal XML node:

doc["//h3"]的每个元素都是一个内部XML节点:

> str(doc['//h3'])
List of 3
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 $ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 - attr(*, "class")= chr "XMLNodeSet"

So the x in your function is just like an element of doc["//h3"]. So you can experiment with doc["//h3"][[1]]

函数中的x就像doc["/ h3"]中的元素。可以用doc["/ h3"][[1]]进行实验

x<- doc['//h3'][[1]]
temp <- gsub("\\WLocation:", "", xmlValue(x))
paste0(temp, ", TX")
[1] "Beaumont, TX"

Then you can apply this logic in your function:

然后你可以把这个逻辑应用到你的函数中:

xpathSApply(doc, "//h3", function(x){
  temp <- gsub("\\WLocation:", "", xmlValue(x))
  paste0(temp, ", TX")
})

[1] "Beaumont, TX"         "Houston, TX"          "Austin, TX"

#2


1  

If you're willing to use rvest and stringr it's a pretty simple solution:

如果你愿意使用rvest和stringr,这是一个非常简单的解决方案:

library(rvest)
library(stringr)

pg <- html("http://www.mehaffyweber.com/Firm/Offices/")

found <- pg %>%
  html_nodes("#content") %>% 
  html_text() %>% 
  str_match_all("([[:alpha:]]+), Texas")  

sprintf("%s, TX", found[[1]][,2])

## [1] "Beaumont, TX" "Houston, TX"  "Austin, TX"  

#3


1  

You can use the following to get your desired result.

您可以使用以下内容来获得所需的结果。

string <- xpathSApply(doc, '//h3', function(x) {
        paste0(sub('^([A-Z][a-z]+).*', '\\1', xmlValue(x)), ', TX')
})
# [1] "Beaumont, TX" "Houston, TX"  "Austin, TX"