C编程:如何使用mmap(2)读取多个线程并行的文件?

时间:2022-11-27 13:49:13

I am trying to write multi-threaded code to read file in fixed chunks using mmap(2) and counts the words. Each thread works on a separate portion of the file, making faster processing of the file. I am able to read the file using mmap(2) single threaded. When the number of threads is more than one, it fails with a segmentation fault.

我尝试使用mmap(2)来编写多线程代码来读取固定块中的文件,并计算单词。每个线程都在文件的单独部分上工作,从而更快地处理文件。我可以使用mmap(2)单线程读取文件。当线程数大于1时,就会出现分段错误。

for( unsigned long cur_pag_num = 0; cur_pag_num < total_blocks; cur_pag_num++ ) {
    mmdata = mmap(
        NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, (fileOffset + (cur_pag_num * PAGE_SIZE))
    );

    if (mmdata == MAP_FAILED) printf(" mmap error ");

    unsigned  long wc = getWordCount( mmdata );
    parserParam->wordCount +=wc;
    munmap( mmdata, PAGE_SIZE );
}

unsigned long getWordCount(char *page){
     unsigned long wordCount=0;
     for(long i = 0 ; page[i] ;i++ ){
        if(page[i]==' ' || page[i]=='\n')
            wordCount++;
     }
     return wordCount;
}

I have figured out that code fails inside getWordCount(mmdata). What am I doing wrong here?

我已经找到了getWordCount(mmdata)中的代码失败。我在这里做错了什么?

Note: size of file is more than the size of main memory. So reading in fixed size chunks (PAGE_SIZE).

注意:文件的大小大于主存的大小。因此,读取固定大小的块(PAGE_SIZE)。

1 个解决方案

#1


0  

getWordCount is accessing outside the mapped page, because the loop stops when it finds a null byte. But mmap() doesn't add a null byte after the mapped page. You need to pass the size of the mapped page to the function. It should stop when it reaches either that index or a null byte (if the file isn't long enough to fill the page, the rest of the page will be zeros).

getWordCount正在访问被映射的页面之外,因为当它找到一个空字节时,循环会停止。但是mmap()不会在映射页面之后添加空字节。您需要将映射页面的大小传递给函数。当它到达索引或空字节时,它应该停止(如果文件不够长,无法填充页面,那么页面的其余部分将为零)。

for( unsigned long cur_pag_num = 0; cur_pag_num < total_blocks; cur_pag_num++ ) {
    mmdata = mmap(
        NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, (fileOffset + (cur_pag_num * PAGE_SIZE))
    );

    if (mmdata == MAP_FAILED) printf(" mmap error ");

    unsigned  long wc = getWordCount( mmdata, PAGE_SIZE );
    parserParam->wordCount +=wc;
    munmap( mmdata, PAGE_SIZE );
}

unsigned long getWordCount(char *page, size){
     unsigned long wordCount=0;
     for(long i = 0 ; i < size && page[i] ;i++ ){
        if(page[i]==' ' || page[i]=='\n')
            wordCount++;
     }
     return wordCount;
}

BTW, there's another problem with your approach: a word that spans page boundaries will be counted twice.

顺便说一下,你的方法还有一个问题:一个跨越页面边界的单词将被计算两次。

#1


0  

getWordCount is accessing outside the mapped page, because the loop stops when it finds a null byte. But mmap() doesn't add a null byte after the mapped page. You need to pass the size of the mapped page to the function. It should stop when it reaches either that index or a null byte (if the file isn't long enough to fill the page, the rest of the page will be zeros).

getWordCount正在访问被映射的页面之外,因为当它找到一个空字节时,循环会停止。但是mmap()不会在映射页面之后添加空字节。您需要将映射页面的大小传递给函数。当它到达索引或空字节时,它应该停止(如果文件不够长,无法填充页面,那么页面的其余部分将为零)。

for( unsigned long cur_pag_num = 0; cur_pag_num < total_blocks; cur_pag_num++ ) {
    mmdata = mmap(
        NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, (fileOffset + (cur_pag_num * PAGE_SIZE))
    );

    if (mmdata == MAP_FAILED) printf(" mmap error ");

    unsigned  long wc = getWordCount( mmdata, PAGE_SIZE );
    parserParam->wordCount +=wc;
    munmap( mmdata, PAGE_SIZE );
}

unsigned long getWordCount(char *page, size){
     unsigned long wordCount=0;
     for(long i = 0 ; i < size && page[i] ;i++ ){
        if(page[i]==' ' || page[i]=='\n')
            wordCount++;
     }
     return wordCount;
}

BTW, there's another problem with your approach: a word that spans page boundaries will be counted twice.

顺便说一下,你的方法还有一个问题:一个跨越页面边界的单词将被计算两次。