在R中将数千个XML文件转换为CSV或分析

时间:2022-10-08 19:49:14

I'm an R data scientist who is used to flat files in a CSV, but I was sent a large dataset (12GB) that is just hundreds of thousands of XML files. I'd like to know how I can stitch these xml files together into a csv or something I can analyze with R.

我是一名R数据科学家,习惯于在CSV中平面文件,但我收到了一个大型数据集(12GB),只有数十万个XML文件。我想知道如何将这些xml文件拼接成csv或我可以用R分析的东西。

The config.txt has some terms I'm unfamiliar with, which I'll paste below, in the hopes that it helps some

config.txt有一些我不熟悉的术语,我将在下面粘贴,希望它有助于一些

# parameters

THRESHOLD       0.7
REMOVE_STOPWORDS    true
WRITE_MERGED_FILES  true
OUTPUT_STATS        true
SIMILARITY      jaccard
OPERATOR        or
N_GRAMS         3
PAGE_LIMIT      1
BUILD_INDEX             false


# matching features to use

MATCH_TITLE     true
MATCH_AUTHORS       false
MATCH_PAGE_COUNT    false
MATCH_VENUE     false

# paths

# SOLR url
BASE_URL        http://davos2.ist.psu.edu:8983/solr/collection1

# DBLP xml file
DBLP_PATH       input/dblp.xml

# file containing the paths for the CiteSeer xml files
CITESEER_PATH       input/citeseerx-pub.rev.txt

There is also a hits.txt file, which is a 160MG file with lines like this

还有一个hits.txt文件,这是一个160MG文件,其中包含这样的行

doi                       hits            time                     

10.1.1.1.1484             12              2.207                    
10.1.1.1.1485             4               0.307   

I'm sure this is some kind of standard format, I just can't seem to find out how to get it into R. Here's the reference paper

我确定这是某种标准格式,我似乎无法找到如何将其纳入R.这是参考文件

http://www.cse.unt.edu/~ccaragea/papers/ecir14.pdf

The XML files are hierarchically structured data on Citeseer articles

XML文件是Citeseer文章中的分层结构数据

Thank you and happy to provide more information

谢谢,很高兴提供更多信息

1 个解决方案

#1


0  

I'm not familiar with the R project, I dot recognise that format. If you need to get XML files into csv format yourself, you most likely need to write some script or code for it, since XML files are typically hierarchical, and csv files have a flat structure. To represent the XML structure in csv, you can do:

我不熟悉R项目,我认识到这种格式。如果您需要自己将XML文件转换为csv格式,则很可能需要为其编写一些脚本或代码,因为XML文件通常是分层的,而csv文件具有扁平结构。要在csv中表示XML结构,您可以:

<tag>A</tag>
    <next>B</next>
        <value>C</value>
    <value>D</value>

to file.csv
A,B,C
A,B,D

to file.csv A,B,C A,B,D

, as in, repeat the data. If you know any script or programming language, I can suggest more. If you'd need a tool, I'd search online. But hopefully, someone recognises the data format.

,如,重复数据。如果你知道任何脚本或编程语言,我可以提出更多建议。如果你需要一个工具,我会在线搜索。但希望有人能够识别数据格式。

#1


0  

I'm not familiar with the R project, I dot recognise that format. If you need to get XML files into csv format yourself, you most likely need to write some script or code for it, since XML files are typically hierarchical, and csv files have a flat structure. To represent the XML structure in csv, you can do:

我不熟悉R项目,我认识到这种格式。如果您需要自己将XML文件转换为csv格式,则很可能需要为其编写一些脚本或代码,因为XML文件通常是分层的,而csv文件具有扁平结构。要在csv中表示XML结构,您可以:

<tag>A</tag>
    <next>B</next>
        <value>C</value>
    <value>D</value>

to file.csv
A,B,C
A,B,D

to file.csv A,B,C A,B,D

, as in, repeat the data. If you know any script or programming language, I can suggest more. If you'd need a tool, I'd search online. But hopefully, someone recognises the data format.

,如,重复数据。如果你知道任何脚本或编程语言,我可以提出更多建议。如果你需要一个工具,我会在线搜索。但希望有人能够识别数据格式。