如何将每行的Wikipedia XML转储解析为一个文档?

时间:2023-01-15 00:07:15

For a project, I need to convert a Wikipedia XML dump into a plain text corpus file which keeps one document per line. I have found several tools for splitting the XML dump into several different files, but this is not the needed format and I fear that managing millions of small files will add unnecessary work to my already slow HDD.

对于一个项目,我需要将Wikipedia XML转换转换为纯文本语料库文件,该文件每行保留一个文档。我找到了几种将XML转储分成几个不同文件的工具,但这不是所需的格式,我担心管理数百万个小文件会给我已经很慢的硬盘增加不必要的工作。

Any suggestions of good programs for this?

对此有什么好的计划建议吗?

1 个解决方案

#1


0  

You could use any streaming XML parser to read the dump page by page, strip line breaks from the page text and print it out. If you told us what language(s) you're using, we might be able to offer more specific suggestions.

您可以使用任何流式XML解析器逐页读取转储,从页面文本中删除换行符并将其打印出来。如果您告诉我们您使用的是哪种语言,我们可能会提供更具体的建议。

(If you're using Perl, I've seen many people recommend the XML::Twig module, but even plain old XML::Parser can do it just fine.)

(如果您正在使用Perl,我看到很多人推荐使用XML :: Twig模块,但即使是普通的XML :: Parser也可以做得很好。)

#1


0  

You could use any streaming XML parser to read the dump page by page, strip line breaks from the page text and print it out. If you told us what language(s) you're using, we might be able to offer more specific suggestions.

您可以使用任何流式XML解析器逐页读取转储,从页面文本中删除换行符并将其打印出来。如果您告诉我们您使用的是哪种语言,我们可能会提供更具体的建议。

(If you're using Perl, I've seen many people recommend the XML::Twig module, but even plain old XML::Parser can do it just fine.)

(如果您正在使用Perl,我看到很多人推荐使用XML :: Twig模块,但即使是普通的XML :: Parser也可以做得很好。)