将XML转换为数据帧

时间:2023-01-15 09:36:24

I am looking to transfer data from a .php link into an R data frame, but am having trouble doing so.

我希望将数据从.php链接传输到R数据框架中,但是这样做有困难。

Attempts thus far:

尝试到目前为止:

require(XML)
data <- xmlParse("http://www.mahdial-husseini.com/xmlthing.php ")
xml_data <- xmlToList(data)

The error I am getting: Error: 1: failed to load HTTP resource

我得到的错误是:error: 1:未能加载HTTP资源

Additionally (and more conceptually), I don't quite understand the nature of the link. Is this XML data in a php file, and if so, when using R to gather data, do I treat it as XML or PHP? Thank you

此外(从概念上来说),我不太理解这种联系的本质。这是php文件中的XML数据吗?如果是,在使用R收集数据时,我是将它当作XML还是php ?谢谢你!

2 个解决方案

#1


2  

Or, possibly something readable:

或者,可能是可读的东西:

library(xml2)
library(tidyverse)

This will help make better column names:

这将有助于写出更好的列名:

mcga <- function(tbl) {
  x <- colnames(tbl)
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  colnames(tbl) <- x
  tbl
} 

This gets figured out automagically but it's nice to define it after it figures it out since it help with data consistency:

这是自动计算出来的,但最好在计算出来后再定义它,因为它有助于数据一致性:

cols(
  .default = col_integer(),
  site = col_character(),
  aod_47 = col_double(),
  omi_aot = col_double(),
  omi_no2 = col_double(),
  fit = col_double(),
  lng = col_double(),
  lat = col_double()
) -> xdf_cols

Now the work:

现在的工作:

doc <- read_xml("http://www.mahdial-husseini.com/xmlthing.php")

xml_find_all(doc, ".//PPM1_0") %>% 
  map_df(~{
    xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga() %>% 
  type_convert(col_types = xdf_cols) -> xdf

The type_convert() isn't fully necessary but it — with the column definitions — make for consistency in results.

type_convert()并不是完全必要的,但是它(使用列定义)可以保证结果的一致性。

And, the results:

的结果:

xdf
## # A tibble: 8 x 21
##   sample        site month   day  year  hour      jd   doy pm25_hourly aod_47 omi_aot omi_no2      fit   res
##    <int>       <chr> <int> <int> <int> <int>   <int> <int>       <int>  <dbl>   <dbl>   <dbl>    <dbl> <int>
## 1      0      duluth     0     0     0     0       0     0           0  0.000   0.000   0.000  0.00000     0
## 2     19        <NA>    12     0  2004     5       0     0          30  0.000   0.000   0.000  0.00000     0
## 3   4545    Sarasota     4     0  2017     0       0     0           0  0.000   0.000   0.000  0.00000     0
## 4  11111     Atlanta    10     1  2004    13 2453280   275          23  0.379   0.148   0.274 16.01850    NA
## 5  11112  Birmingham    10     2  2008    14 2453281   276           0  0.000   0.000   0.000 19.19440     0
## 6  11113    Savannah    10     3  2004    13 2453282   277          15  0.181   0.133   0.127  9.00433    NA
## 7  11114   Fort Knox     6    20  2017    21       0   301          18  0.000   0.000   0.000  0.00000     0
## 8  63738 Fort Rucker     1     0  2015     0       0     0          40  0.000   0.000   0.000  0.00000     0
## # ... with 7 more variables: lng <dbl>, lat <dbl>, rel_humid <int>, altitude <int>, pressure <int>,
## #   signal_received <int>, temp_c <int>

Full structure:

完整的结构:

glimpse(xdf)
## Observations: 8
## Variables: 21
## $ sample          <int> 0, 19, 4545, 11111, 11112, 11113, 11114, 63738
## $ site            <chr> "duluth", NA, "Sarasota", "Atlanta", "Birmingham", "Savan...
## $ month           <int> 0, 12, 4, 10, 10, 10, 6, 1
## $ day             <int> 0, 0, 0, 1, 2, 3, 20, 0
## $ year            <int> 0, 2004, 2017, 2004, 2008, 2004, 2017, 2015
## $ hour            <int> 0, 5, 0, 13, 14, 13, 21, 0
## $ jd              <int> 0, 0, 0, 2453280, 2453281, 2453282, 0, 0
## $ doy             <int> 0, 0, 0, 275, 276, 277, 301, 0
## $ pm25_hourly     <int> 0, 30, 0, 23, 0, 15, 18, 40
## $ aod_47          <dbl> 0.000, 0.000, 0.000, 0.379, 0.000, 0.181, 0.000, 0.000
## $ omi_aot         <dbl> 0.000, 0.000, 0.000, 0.148, 0.000, 0.133, 0.000, 0.000
## $ omi_no2         <dbl> 0.000, 0.000, 0.000, 0.274, 0.000, 0.127, 0.000, 0.000
## $ fit             <dbl> 0.00000, 0.00000, 0.00000, 16.01850, 19.19440, 9.00433, 0...
## $ res             <int> 0, 0, 0, NA, 0, NA, 0, 0
## $ lng             <dbl> 84.1000, 63.6167, -82.5300, -84.7000, -86.8000, -81.1000,...
## $ lat             <dbl> 34.0000, 38.4161, 27.3300, 33.7500, 33.5200, 32.0800, 37....
## $ rel_humid       <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ altitude        <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ pressure        <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ signal_received <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ temp_c          <int> 0, 0, 0, 0, 0, 0, 0, 0

#2


2  

you can use rvest package (and data.table for convenience)

您可以使用rvest包(和数据)。表为了方便)

library(data.table)
library(rvest)
a <- read_html("http://www.mahdial-husseini.com/xmlthing.php")
dt <- rbindlist(lapply(a %>% html_nodes(css = "body > ppm1_0 > ppm1_0") %>% 
                             xml_attrs(), 
                       function(x) as.data.table(t((x)))))
dt <- cbind(dt[,2, with = FALSE], 
            as.data.table(lapply(dt[,-2, with = FALSE], as.numeric)))
dt

          site sample month day year hour      jd doy pm25_hourly aod_47
1:      duluth      0     0   0    0    0       0   0           0  0.000
2:                 19    12   0 2004    5       0   0          30  0.000
3:    Sarasota   4545     4   0 2017    0       0   0           0  0.000
4:     Atlanta  11111    10   1 2004   13 2453280 275          23  0.379
5:  Birmingham  11112    10   2 2008   14 2453281 276           0  0.000
6:    Savannah  11113    10   3 2004   13 2453282 277          15  0.181
7:   Fort Knox  11114     6  20 2017   21       0 301          18  0.000
8: Fort Rucker  63738     1   0 2015    0       0   0          40  0.000
   omi_aot omi_no2      fit res      lng     lat rel_humid altitude pressure
1:   0.000   0.000  0.00000   0  84.1000 34.0000         0        0        0
2:   0.000   0.000  0.00000   0  63.6167 38.4161         0        0        0
3:   0.000   0.000  0.00000   0 -82.5300 27.3300         0        0        0
4:   0.148   0.274 16.01850  NA -84.7000 33.7500         0        0        0
5:   0.000   0.000 19.19440   0 -86.8000 33.5200         0        0        0
6:   0.133   0.127  9.00433  NA -81.1000 32.0800         0        0        0
7:   0.000   0.000  0.00000   0 -85.9500 37.9100         0        0        0
8:   0.000   0.000  0.00000   0 -85.7000 31.3400         0        0        0
   signal_received temp_c
1:               0      0
2:               0      0
3:               0      0
4:               0      0
5:               0      0
6:               0      0
7:               0      0
8:               0      0

#1


2  

Or, possibly something readable:

或者,可能是可读的东西:

library(xml2)
library(tidyverse)

This will help make better column names:

这将有助于写出更好的列名:

mcga <- function(tbl) {
  x <- colnames(tbl)
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  colnames(tbl) <- x
  tbl
} 

This gets figured out automagically but it's nice to define it after it figures it out since it help with data consistency:

这是自动计算出来的,但最好在计算出来后再定义它,因为它有助于数据一致性:

cols(
  .default = col_integer(),
  site = col_character(),
  aod_47 = col_double(),
  omi_aot = col_double(),
  omi_no2 = col_double(),
  fit = col_double(),
  lng = col_double(),
  lat = col_double()
) -> xdf_cols

Now the work:

现在的工作:

doc <- read_xml("http://www.mahdial-husseini.com/xmlthing.php")

xml_find_all(doc, ".//PPM1_0") %>% 
  map_df(~{
    xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga() %>% 
  type_convert(col_types = xdf_cols) -> xdf

The type_convert() isn't fully necessary but it — with the column definitions — make for consistency in results.

type_convert()并不是完全必要的,但是它(使用列定义)可以保证结果的一致性。

And, the results:

的结果:

xdf
## # A tibble: 8 x 21
##   sample        site month   day  year  hour      jd   doy pm25_hourly aod_47 omi_aot omi_no2      fit   res
##    <int>       <chr> <int> <int> <int> <int>   <int> <int>       <int>  <dbl>   <dbl>   <dbl>    <dbl> <int>
## 1      0      duluth     0     0     0     0       0     0           0  0.000   0.000   0.000  0.00000     0
## 2     19        <NA>    12     0  2004     5       0     0          30  0.000   0.000   0.000  0.00000     0
## 3   4545    Sarasota     4     0  2017     0       0     0           0  0.000   0.000   0.000  0.00000     0
## 4  11111     Atlanta    10     1  2004    13 2453280   275          23  0.379   0.148   0.274 16.01850    NA
## 5  11112  Birmingham    10     2  2008    14 2453281   276           0  0.000   0.000   0.000 19.19440     0
## 6  11113    Savannah    10     3  2004    13 2453282   277          15  0.181   0.133   0.127  9.00433    NA
## 7  11114   Fort Knox     6    20  2017    21       0   301          18  0.000   0.000   0.000  0.00000     0
## 8  63738 Fort Rucker     1     0  2015     0       0     0          40  0.000   0.000   0.000  0.00000     0
## # ... with 7 more variables: lng <dbl>, lat <dbl>, rel_humid <int>, altitude <int>, pressure <int>,
## #   signal_received <int>, temp_c <int>

Full structure:

完整的结构:

glimpse(xdf)
## Observations: 8
## Variables: 21
## $ sample          <int> 0, 19, 4545, 11111, 11112, 11113, 11114, 63738
## $ site            <chr> "duluth", NA, "Sarasota", "Atlanta", "Birmingham", "Savan...
## $ month           <int> 0, 12, 4, 10, 10, 10, 6, 1
## $ day             <int> 0, 0, 0, 1, 2, 3, 20, 0
## $ year            <int> 0, 2004, 2017, 2004, 2008, 2004, 2017, 2015
## $ hour            <int> 0, 5, 0, 13, 14, 13, 21, 0
## $ jd              <int> 0, 0, 0, 2453280, 2453281, 2453282, 0, 0
## $ doy             <int> 0, 0, 0, 275, 276, 277, 301, 0
## $ pm25_hourly     <int> 0, 30, 0, 23, 0, 15, 18, 40
## $ aod_47          <dbl> 0.000, 0.000, 0.000, 0.379, 0.000, 0.181, 0.000, 0.000
## $ omi_aot         <dbl> 0.000, 0.000, 0.000, 0.148, 0.000, 0.133, 0.000, 0.000
## $ omi_no2         <dbl> 0.000, 0.000, 0.000, 0.274, 0.000, 0.127, 0.000, 0.000
## $ fit             <dbl> 0.00000, 0.00000, 0.00000, 16.01850, 19.19440, 9.00433, 0...
## $ res             <int> 0, 0, 0, NA, 0, NA, 0, 0
## $ lng             <dbl> 84.1000, 63.6167, -82.5300, -84.7000, -86.8000, -81.1000,...
## $ lat             <dbl> 34.0000, 38.4161, 27.3300, 33.7500, 33.5200, 32.0800, 37....
## $ rel_humid       <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ altitude        <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ pressure        <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ signal_received <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ temp_c          <int> 0, 0, 0, 0, 0, 0, 0, 0

#2


2  

you can use rvest package (and data.table for convenience)

您可以使用rvest包(和数据)。表为了方便)

library(data.table)
library(rvest)
a <- read_html("http://www.mahdial-husseini.com/xmlthing.php")
dt <- rbindlist(lapply(a %>% html_nodes(css = "body > ppm1_0 > ppm1_0") %>% 
                             xml_attrs(), 
                       function(x) as.data.table(t((x)))))
dt <- cbind(dt[,2, with = FALSE], 
            as.data.table(lapply(dt[,-2, with = FALSE], as.numeric)))
dt

          site sample month day year hour      jd doy pm25_hourly aod_47
1:      duluth      0     0   0    0    0       0   0           0  0.000
2:                 19    12   0 2004    5       0   0          30  0.000
3:    Sarasota   4545     4   0 2017    0       0   0           0  0.000
4:     Atlanta  11111    10   1 2004   13 2453280 275          23  0.379
5:  Birmingham  11112    10   2 2008   14 2453281 276           0  0.000
6:    Savannah  11113    10   3 2004   13 2453282 277          15  0.181
7:   Fort Knox  11114     6  20 2017   21       0 301          18  0.000
8: Fort Rucker  63738     1   0 2015    0       0   0          40  0.000
   omi_aot omi_no2      fit res      lng     lat rel_humid altitude pressure
1:   0.000   0.000  0.00000   0  84.1000 34.0000         0        0        0
2:   0.000   0.000  0.00000   0  63.6167 38.4161         0        0        0
3:   0.000   0.000  0.00000   0 -82.5300 27.3300         0        0        0
4:   0.148   0.274 16.01850  NA -84.7000 33.7500         0        0        0
5:   0.000   0.000 19.19440   0 -86.8000 33.5200         0        0        0
6:   0.133   0.127  9.00433  NA -81.1000 32.0800         0        0        0
7:   0.000   0.000  0.00000   0 -85.9500 37.9100         0        0        0
8:   0.000   0.000  0.00000   0 -85.7000 31.3400         0        0        0
   signal_received temp_c
1:               0      0
2:               0      0
3:               0      0
4:               0      0
5:               0      0
6:               0      0
7:               0      0
8:               0      0