为什么我的C ++文本文件解析脚本比我的Python脚本慢得多?

时间:2023-01-16 00:24:29

I am currently trying to teach myself c++, and I am working on file IO. I have read through the cplusplus.com tutorial, and am using the basic file IO techniques I learned there:

我目前正在尝试自学c ++,而我正在研究文件IO。我已经阅读了cplusplus.com教程,并使用了我在那里学到的基本文件IO技术:

std::ifstream  \\using this to open a read-only file
std::ofstream  \\using this to create an output file
std::getline  \\using this to read each line of the file
outputfile << linecontents \\using this to write to the output file

I have an approximately 10MB text file containing the first million primes, which are separated by whitespace, 8 primes to a line. My goal is to write a program which will open the file, read through the contents, and write a new file with one prime number per line. I am using regular expressions to strip the whitespace on the ends of each line, and to replace the whitespace between each number with a single newline character.

我有一个大约10MB的文本文件,其中包含第一百万个素数,它们用空格分隔,8个素数到一条线。我的目标是编写一个程序,它将打开文件,读取内容,并编写一个每行一个素数的新文件。我使用正则表达式去除每行末尾的空白,并用一个换行符替换每个数字之间的空格。

The basic algorithm is simple: using regular expressions, I trim the whitespace on the ends of each line, and replace the whitespace in the middle with a newline character, and write that string to the output file. I have written the 'same' algorithm in c++ and Python (except I use the built-in strip() function to remove leading and trailing whitespace), and the Python program is much quicker! I expect the opposite; I would think that a (well-written) c++ program should be lightning fast, and a Python program 10-20 times slower. Whatever optimization is done behind-the-scenes in Python, though is making it way faster than my 'equivalent' c++ program.

基本算法很简单:使用正则表达式,我修剪每行末尾的空白,并用换行符替换中间的空格,并将该字符串写入输出文件。我在c ++和Python中编写了“相同”的算法(除了我使用内置的strip()函数来删除前导和尾随空格),Python程序要快得多!我期待相反;我认为(编写良好的)c ++程序应该是快速的,而Python程序要慢10-20倍。无论在Python中幕后做什么优化,都会比我的'等效'c ++程序更快。

My regex searches:

我的正则表达式搜索:

std::tr1::regex rxLeadingTrailingWS("^(\\s)+|(\\s)+$"); //whitespace at beginning or end of string
std::tr1::regex rxWS("(\\s)+"); //whitespace anywhere

My file-parsing code:

我的文件解析代码:

void ReWritePrimesFile()
{
    std::ifstream readFile("..//primes1.txt");
    std::ofstream reducedPrimeList("..//newprimelist.txt");
    std::string readout;
    std::string tempLine;

    std::tr1::regex rxLeadingTrailingWS("^(\\s)+|(\\s)+$"); //whitespace at beginning or end of string
    std::tr1::regex rxWS("(\\s)+"); //whitespace anywhere
    std::tr1::cmatch res; //the variable which a regex_search writes its results to

    while (std::getline(readFile, readout)){
        tempLine = std::tr1::regex_replace(readout.c_str(), rxLeadingTrailingWS, ""); //remove leading and trailing whitespace
        reducedPrimeList << std::tr1::regex_replace(tempLine.c_str(), rxWS, "\n") << "\n"; //replace all other whitespace with newlines
    }

    reducedPrimeList.close();
}

However, this code is taking minutes to parse through a 10 MB file. The following Python script takes approx 1-3 seconds (haven't timed it):

但是,此代码需要几分钟才能解析10 MB文件。以下Python脚本大约需要1-3秒(还没有计时):

import re
rxWS = r'\s+'
with open('pythonprimeoutput.txt', 'w') as newfile:
    with open('primes1.txt', 'r') as f:
        for line in f.readlines():
            newfile.write(re.sub(rxWS, "\n", line.strip()) + "\n")

The only notable difference is that I'm using the built-in strip() function to strip newlines instead of using a regular expression. (Is this the source of my terribly slow execution time?)

唯一值得注意的区别是我使用内置的strip()函数来去除换行而不是使用正则表达式。 (这是我非常慢的执行时间的来源吗?)

I'm not sure at all where the horrible inefficiency in my program is coming from. A 10MB file should not take this long to parse through!

我不确定程序中可怕的低效率来自哪里。一个10MB的文件不应该花这么长时间来解析!

*edited: originally showed the file at 20MB, it's only 10MB.

*编辑:最初显示文件为20MB,只有10MB。

Per Nathan Oliver's suggestion, I used the following code, which still took about 5 minutes to run. This is now pretty much the same algorithm I used in Python. Still not sure what's different.

根据Nathan Oliver的建议,我使用了以下代码,仍然需要大约5分钟才能运行。现在这与我在Python中使用的算法几乎相同。仍然不确定有什么不同。

void ReWritePrimesFile()
{
    std::ifstream readFile("..//primes.txt");
    std::ofstream reducedPrimeList("..//newprimelist.txt");
    std::string readout;
    std::string tempLine;

    //std::tr1::regex rxLeadingTrailingWS("^(\\s)+|(\\s)+$"); //whitespace at beginning or end of string
    std::tr1::regex rxWS("(\\s)+"); //whitespace anywhere
    std::tr1::cmatch res; //the variable which a regex_search writes its results to

    while (readFile >> readout){
        reducedPrimeList << std::tr1::regex_replace(readout.c_str(), rxWS, "\n") + "\n"; //replace all whitespace with newlines
    }

    reducedPrimeList.close();
}

second edit: I had to add an additional newline character at the end of the regex_replace line. Apparently the readFile >> readout stops at every whitespace character? Not sure how it works, but it runs an iteration of the while loop for each number in the file, not for each line in the file.

第二次编辑:我必须在regex_replace行的末尾添加一个额外的换行符。显然,readFile >> readout在每个空白字符处停止?不知道它是如何工作的,但它为文件中的每个数字运行while循环的迭代,而不是文件中的每一行。

1 个解决方案

#1


The code you have is slower because you are doing two regex calls in the C++ code. Just so you know if you use the >> operator to read from the file and it will ignore leading white space and read until another white space character is found. You could easily write your function like:

您拥有的代码较慢,因为您在C ++代码中进行了两次正则表达式调用。只是因为你知道你是否使用>>运算符来读取文件,它将忽略前导空格并读取直到找到另一个空白字符。您可以轻松编写您的函数,如:

void ReWritePrimesFile()
{
    std::ifstream readFile("..//primes1.txt");
    std::ofstream reducedPrimeList("..//newprimelist.txt");
    std::string readout;

    while(readFile >> readout)
        reducedPrimeList << readout << '\n';
}

#1


The code you have is slower because you are doing two regex calls in the C++ code. Just so you know if you use the >> operator to read from the file and it will ignore leading white space and read until another white space character is found. You could easily write your function like:

您拥有的代码较慢,因为您在C ++代码中进行了两次正则表达式调用。只是因为你知道你是否使用>>运算符来读取文件,它将忽略前导空格并读取直到找到另一个空白字符。您可以轻松编写您的函数,如:

void ReWritePrimesFile()
{
    std::ifstream readFile("..//primes1.txt");
    std::ofstream reducedPrimeList("..//newprimelist.txt");
    std::string readout;

    while(readFile >> readout)
        reducedPrimeList << readout << '\n';
}