如何在Python中拆分大型wikipedia转储.xml.bz2文件?

时间:2022-05-06 11:14:06

I am trying to build a offline wiktionary using the wikimedia dump files (.xml.bz2) using Python. I started with this article as the guide. It involves a number of languages, I wanted to combine all the steps as a single python project. I have found almost all the libraries required for the process. The only hump now is to effectively split the large .xml.bz2 file into number of smaller files for quicker parsing during search operations.

我正在尝试使用Python使用维基媒体转储文件(.xml.bz2)构建脱机wiki。我从这篇文章开始作为指南。它涉及多种语言,我想将所有步骤组合为一个单独的python项目。我找到了该过程所需的几乎所有库。现在唯一的问题是将大型.xml.bz2文件有效地拆分成多个较小的文件,以便在搜索操作期间更快地解析。

I know that bz2 library exists in python, but it provides only compress and decompress operations. But I need something that could do something like bz2recover does from the command line, which splits large files into a number of smaller junks.

我知道bz2库存在于python中,但它只提供压缩和解压缩操作。但我需要一些能够像命令行那样做bz2recover的东西,它会将大文件分成许多小块。

One more important point is the splitting shouldn't split the page contents which start with <page> and ends </page> in the xml document that has been compressed.

另一个重要的一点是,拆分不应拆分以 开头的页面内容,并在已压缩的xml文档中结束 。

Is there a library previously available which could handle this situation or the code has to be written from scratch?(Any outline/pseudo-code would be greatly helpful).

是否有一个以前可用的库可以处理这种情况或者代码必须从头开始编写?(任何大纲/伪代码都会非常有用)。

Note: I would like to make the resulting package cross-platform compatible, hence couldn't use OS specific commands.

注意:我想使得生成的包跨平台兼容,因此无法使用特定于操作系统的命令。

3 个解决方案

#1


12  

At last I have written a Python Script myself:

最后我自己编写了一个Python脚本:

import os
import bz2

def split_xml(filename):
    ''' The function gets the filename of wiktionary.xml.bz2 file as  input and creates
    smallers chunks of it in a the diretory chunks
    '''
    # Check and create chunk diretory
    if not os.path.exists("chunks"):
        os.mkdir("chunks")
    # Counters
    pagecount = 0
    filecount = 1
    #open chunkfile in write mode
    chunkname = lambda filecount: os.path.join("chunks","chunk-"+str(filecount)+".xml.bz2")
    chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
    # Read line by line
    bzfile = bz2.BZ2File(filename)
    for line in bzfile:
        chunkfile.write(line)
        # the </page> determines new wiki page
        if '</page>' in line:
            pagecount += 1
        if pagecount > 1999:
            #print chunkname() # For Debugging
            chunkfile.close()
            pagecount = 0 # RESET pagecount
            filecount += 1 # increment filename           
            chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
    try:
        chunkfile.close()
    except:
        print 'Files already close'

if __name__ == '__main__':
    # When the script is self run
    split_xml('wiki-files/tawiktionary-20110518-pages-articles.xml.bz2')

#2


1  

well, if you have a command-line-tool that offers the functionality you are after, you can always wrap it in a call using the subprocess module

好吧,如果你有一个提供你所使用的功能的命令行工具,你总是可以使用子进程模块将它包装在一个调用中

#3


0  

The method you are referencing is quite a dirty hack :)

你引用的方法是一个很脏的黑客:)

I wrote a offline Wikipedia Tool, and just Sax-parsed the dump completely. The throughput is usable if you just pipe the uncompressed xml into stdin from a proper bzip2 decompressor. Especially if its only the wiktionary.

我写了一个离线的*工具,只是萨克斯 - 完全解析了转储。如果您只是将未压缩的xml从正确的bzip2解压缩器传输到stdin,则吞吐量是可用的。特别是如果它只是维基词典。

As a simple way for testing I just compressed every page and wrote it into one big file and saved the offset and length in a cdb(small key-value store). This may be a valid solution for you.

作为一种简单的测试方法,我只压缩每一页并将其写入一个大文件,并将偏移和长度保存在cdb(小键值存储)中。这可能是一个有效的解决方案。

Keep in mind, the mediawiki markup is the most horrible piece of sh*t I've come across in a long time. But in case of the wiktionary it might me feasible to handle.

请记住,mediawiki标记是我在很长一段时间内遇到的最可怕的部分。但是在*的情况下,我可以处理。

#1


12  

At last I have written a Python Script myself:

最后我自己编写了一个Python脚本:

import os
import bz2

def split_xml(filename):
    ''' The function gets the filename of wiktionary.xml.bz2 file as  input and creates
    smallers chunks of it in a the diretory chunks
    '''
    # Check and create chunk diretory
    if not os.path.exists("chunks"):
        os.mkdir("chunks")
    # Counters
    pagecount = 0
    filecount = 1
    #open chunkfile in write mode
    chunkname = lambda filecount: os.path.join("chunks","chunk-"+str(filecount)+".xml.bz2")
    chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
    # Read line by line
    bzfile = bz2.BZ2File(filename)
    for line in bzfile:
        chunkfile.write(line)
        # the </page> determines new wiki page
        if '</page>' in line:
            pagecount += 1
        if pagecount > 1999:
            #print chunkname() # For Debugging
            chunkfile.close()
            pagecount = 0 # RESET pagecount
            filecount += 1 # increment filename           
            chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
    try:
        chunkfile.close()
    except:
        print 'Files already close'

if __name__ == '__main__':
    # When the script is self run
    split_xml('wiki-files/tawiktionary-20110518-pages-articles.xml.bz2')

#2


1  

well, if you have a command-line-tool that offers the functionality you are after, you can always wrap it in a call using the subprocess module

好吧,如果你有一个提供你所使用的功能的命令行工具,你总是可以使用子进程模块将它包装在一个调用中

#3


0  

The method you are referencing is quite a dirty hack :)

你引用的方法是一个很脏的黑客:)

I wrote a offline Wikipedia Tool, and just Sax-parsed the dump completely. The throughput is usable if you just pipe the uncompressed xml into stdin from a proper bzip2 decompressor. Especially if its only the wiktionary.

我写了一个离线的*工具,只是萨克斯 - 完全解析了转储。如果您只是将未压缩的xml从正确的bzip2解压缩器传输到stdin,则吞吐量是可用的。特别是如果它只是维基词典。

As a simple way for testing I just compressed every page and wrote it into one big file and saved the offset and length in a cdb(small key-value store). This may be a valid solution for you.

作为一种简单的测试方法,我只压缩每一页并将其写入一个大文件,并将偏移和长度保存在cdb(小键值存储)中。这可能是一个有效的解决方案。

Keep in mind, the mediawiki markup is the most horrible piece of sh*t I've come across in a long time. But in case of the wiktionary it might me feasible to handle.

请记住,mediawiki标记是我在很长一段时间内遇到的最可怕的部分。但是在*的情况下,我可以处理。