I have a very complicated xml file need to parse and present in dataframe format in R. The structure may similar to the following example. The nodes are not paralleled.
我有一个非常复杂的xml文件需要解析并以r的dataframe格式呈现。节点不是并行的。
<Root>
<A>
<info1>a</info1>
<child>
<info2>b</info2>
<info3>c</info3>
<info4>d</info4>
</child>
<info5>e</info5>
</A>
<B>
<info6>f</info6>
<info7>g</info7>
</B>
</Root>
I come up some code to parse the file:
我找到一些代码来解析这个文件:
doc <- xmlParse(file="sample.xml", useInternal = TRUE)
rootnode <- xmlRoot(doc)
df1<-xmlToDataFrame(nodes=getNodeSet(rootnode, "//Root/A"))
df2<-xmlToDataFrame(nodes=getNodeSet(rootnode, "//Root/B"))
Final<-cbind.data.frame(df1,df2, all=TRUE)
The result returned as: (all the value form node were shrink together)
返回的结果为:(所有值表单节点一起收缩)
info1 child info5 info6 info7
a bcd e f g
However, the ideal result I want is:
但是,我想要的理想结果是:
info1 info2 info3 info4 info5 info6 info7
a b c d e f g
Because there are large number of nodes in the xml file similar to the situation above, it is not wise to manually manipulate the dataframe.
I also try to change the path statement to "//Root/A/child", then all the value under node A and node B will be missed. Does anyone could offer the solution to this problem. Thanks in advance.
因为xml文件中有大量的节点与上面的情况类似,所以手工操作dataframe是不明智的。我还尝试将path语句更改为“//Root/A/child”,然后会遗漏节点A和节点B下的所有值。有没有人能提出这个问题的解决方案?提前谢谢。
3 个解决方案
#1
2
One can try xmlToList
and unlist
to reduce xml
data in named vector format. The names can be changed using gsub
to match OP's expectations as:
可以尝试xmlToList和unlist来减少命名向量格式的xml数据。使用gsub可以更改名称,以符合OP的期望:
library(XML)
result <- unlist(xmlToList(xmlParse(xml)))
#Change the name to refer only child
names(result) <- gsub(".*\\.(\\w+)$","\\1", names(result))
result
# info1 info2 info3 info4 info5 info6 info7
# "a" "b" "c" "d" "e" "f" "g"
Data:
数据:
xml <- "<Root>
<A>
<info1>a</info1>
<child>
<info2>b</info2>
<info3>c</info3>
<info4>d</info4>
</child>
<info5>e</info5>
</A>
<B>
<info6>f</info6>
<info7>g</info7>
</B>
</Root>"
#2
0
In a less structured XML, it is better to do the following:
在非结构化的XML中,最好做到以下几点:
library(XML)
Final <- data.frame(xmlToList(rootnode), recursive = T, use.names = T)
If you don't like the automatically set column names, you can simply do use.names = F
and set your own names.
如果您不喜欢自动设置列名,您可以使用。names = F并设置您自己的列名。
#3
0
Match the nodes using starts-with()
使用starts-with()匹配节点
> doc = xmlParse(xml)
> xpathSApply(doc, "//*[starts-with(name(), 'info')]", xmlValue)
[1] "a" "b" "c" "d" "e" "f" "g"
> xpathSApply(doc, "//*[starts-with(name(), 'info')]", xmlName)
[1] "info1" "info2" "info3" "info4" "info5" "info6" "info7"
so
所以
query <- "//*[starts-with(name(), 'info')]"
setNames(
xpathSApply(doc, query, xmlValue),
xpathSApply(doc, query, xmlName)
)
#1
2
One can try xmlToList
and unlist
to reduce xml
data in named vector format. The names can be changed using gsub
to match OP's expectations as:
可以尝试xmlToList和unlist来减少命名向量格式的xml数据。使用gsub可以更改名称,以符合OP的期望:
library(XML)
result <- unlist(xmlToList(xmlParse(xml)))
#Change the name to refer only child
names(result) <- gsub(".*\\.(\\w+)$","\\1", names(result))
result
# info1 info2 info3 info4 info5 info6 info7
# "a" "b" "c" "d" "e" "f" "g"
Data:
数据:
xml <- "<Root>
<A>
<info1>a</info1>
<child>
<info2>b</info2>
<info3>c</info3>
<info4>d</info4>
</child>
<info5>e</info5>
</A>
<B>
<info6>f</info6>
<info7>g</info7>
</B>
</Root>"
#2
0
In a less structured XML, it is better to do the following:
在非结构化的XML中,最好做到以下几点:
library(XML)
Final <- data.frame(xmlToList(rootnode), recursive = T, use.names = T)
If you don't like the automatically set column names, you can simply do use.names = F
and set your own names.
如果您不喜欢自动设置列名,您可以使用。names = F并设置您自己的列名。
#3
0
Match the nodes using starts-with()
使用starts-with()匹配节点
> doc = xmlParse(xml)
> xpathSApply(doc, "//*[starts-with(name(), 'info')]", xmlValue)
[1] "a" "b" "c" "d" "e" "f" "g"
> xpathSApply(doc, "//*[starts-with(name(), 'info')]", xmlName)
[1] "info1" "info2" "info3" "info4" "info5" "info6" "info7"
so
所以
query <- "//*[starts-with(name(), 'info')]"
setNames(
xpathSApply(doc, query, xmlValue),
xpathSApply(doc, query, xmlName)
)