有没有解决方案来解析Java中的wikipedia xml转储文件？

I am trying to parse this huge 25GB Plus wikipedia XML file. Any solution that will help would be appreciated. Preferably a solution in Java.

我正在尝试解析这个巨大的25GB Plus*XML文件。任何有用的解决方案都将受到赞赏。优选地是Java中的解决方案。

8 个解决方案

#1

A Java API to parse Wikipedia XML dumps: WikiXMLJ (Last update was at Nov 2010).
Also, there is alive mirror that is maven-compatible with some bug fixes.

用于解析Wikipedia XML转储的Java API:WikiXMLJ(上次更新时间为2010年11月)。此外,还有一些活动镜像与一些错误修复maven兼容。

#2

Ofcourse it's possible to parse huge XML files with Java, but you should use the right kind of XML parser - for example a SAX parser which processes the data element by element, and not a DOM parser which tries to load the whole document into memory.

当然,使用Java解析巨大的XML文件是可能的,但是你应该使用正确的XML解析器 - 例如,按元素处理数据元素的SAX解析器,而不是尝试将整个文档加载到内存中的DOM解析器。

It's impossible to give you a complete solution because your question is very general and superficial - what exactly do you want to do with the data?

为您提供一个完整的解决方案是不可能的,因为您的问题非常笼统和肤浅 - 您究竟想要对数据做些什么?

#3

Here is an active java project that may be used to parse wikipedia xml dump files:
http://code.google.com/p/gwtwiki/. There are many examples of java programmes to transform wikipedia xml content into html, pdf, text, ... : http://code.google.com/p/gwtwiki/wiki/MediaWikiDumpSupport

这是一个活动的java项目,可用于解析*xml转储文件:http://code.google.com/p/gwtwiki/。有很多java程序可以将*xml内容转换为html,pdf,text,...:http://code.google.com/p/gwtwiki/wiki/MediaWikiDumpSupport

Massi

#4

Yep, right. Do not use DOM. If you want to read small amount of data only, and want to store in your own POJO then you can use XSLT transformation also.

对,对。不要使用DOM。如果您只想读取少量数据,并希望存储在自己的POJO中,那么您也可以使用XSLT转换。

Transforming data into XML format which is then converted to some POJO using Castor/JAXB (XML to ojbect libraries).

将数据转换为XML格式,然后使用Castor / JAXB(XML到ojbect库)将其转换为某些POJO。

Please share how you solve the problem so others can have better approach.

请分享您如何解决问题,以便其他人可以采取更好的方法。

thanks.

--- EDIt ---

Check the links below for better comparison between different parsers. It seems that STAX is better because it has control over the parser and it pulls data from parser when needed.

检查下面的链接,以便更好地比较不同的解析器。似乎STAX更好,因为它可以控制解析器,并在需要时从解析器中提取数据。

http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP2.html

http://tutorials.jenkov.com/java-xml/sax-vs-stax.html

#5

If you don't intend to write or change anything in that xml, consider using SAX. It keeps in memory one node at a time (instead of DOM, which tries to build the whole tree in the memory).

如果您不打算在该xml中编写或更改任何内容,请考虑使用SAX。它一次在一个节点内存(而不是DOM,它试图在内存中构建整个树)。

#6

I would go with StAX as it provides more flexibility than SAX (also good option).

我会选择StAX,因为它提供了比SAX更多的灵活性(也是不错的选择)。

#7

There is a standalone application that parses Wikipedia dumps into XML and plain text, called Wiki Parser.

有一个独立的应用程序可以将Wikipedia转储分解为XML和纯文本,称为Wiki Parser。

In principle, you can parse the Wikipedia dump and then use Java to do anything you need with the XML or plain text.

原则上,您可以解析Wikipedia转储,然后使用Java通过XML或纯文本执行您需要的任何操作。

The advantage of doing it that way is that WikiParser is very fast and takes only 2-3 hours to parse all current English Wikipedia articles.

这样做的好处是WikiParser非常快,只需要2-3个小时来解析所有当前的英语*文章。

#8

I had this problem some days ago I found out that the wiki parser provided by https://github.com/Stratio/wikipedia-parser does the work. They stream the xml file and read it in chunks which you can then capture in callbacks.

几天前我遇到了这个问题,我发现https://github.com/Stratio/wikipedia-parser提供的wiki解析器可以完成这项工作。它们流式传输xml文件并以块的形式读取,然后您可以在回调中捕获它们。

This is a snippet of how I used it in Scala:

这是我在Scala中如何使用它的片段:

val parser = new XMLDumpParser(new BZip2CompressorInputStream(new BufferedInputStream(new FileInputStream(pathToWikipediaDump)), true))

  parser.getContentHandler.setRevisionCallback(new RevisionCallback {
  override def callback(revision: Revision): Unit = {
  val page = revision.getPage
  val title = page.getTitle
  val articleText =  revision.getText()
  println(articleText)
}

It streams the wikipedia, parses it, and everytime it finds a revision(Article) it will get its title,text and print the article's text. :)

它流式传输*,对其进行解析,每次找到修订版(文章)时,它都会获得其标题,文本并打印文章的文本。 :)

--- Edit ---

---编辑---

Currently I am working on https://github.com/idio/wiki2vec which I think does part of the pipeline which you might need. Feel free to take a look at the code

目前我正在开发https://github.com/idio/wiki2vec,我认为这可能是您可能需要的管道的一部分。随意看看代码

#1