在C/ c++中搜索大文件中的数据

时间:2022-07-22 21:33:08

I have a log file which has a format of this kind:

我有一个日志文件,格式是这样的:

DATE-TIME ### attribute1  ### attribute2 ###attribute3 

I have to search this log file for a input attribute(entered from command line) and output the lines that match the entered attribute.
A naive approach may be something like this:

我必须在这个日志文件中搜索输入属性(从命令行输入),并输出与输入属性匹配的行。一种幼稚的方法可能是这样的:

scan the entire file line by line
search for the attribute
print if found, else ignore.

This approach is slow as it would require O(n) comparisons, where n is number of lines which may be very large.
Another approach may be to use a hash-table but keeping such a in-memory hash-table for a big file may not be possible.
So, what is the best feasible solution? How can I possibly index the entire file on various attributes?

这种方法很慢,因为它需要O(n)比较,其中n是可能非常大的行数。另一种方法可能是使用哈希表,但是为大文件保留这样一个内存中的哈希表可能是不可能的。那么,最佳可行解是什么呢?如何将整个文件索引到不同的属性?

EDIT:
The log file may be about 100K lines, almost like the system log files on linux. On One invocation, a user may search for multiple attributes, which is not known until the search on 1st attribute is completed like an interactive console.

编辑:日志文件可能有大约100K行,就像linux上的系统日志文件一样。在一次调用中,用户可以搜索多个属性,直到第1个属性的搜索像交互控制台一样完成之后才知道这些属性。

Thanks,

谢谢,

10 个解决方案

#1


2  

You can reduce the size of the hash table by only storing hash values and file offsets in it. If the attributes only have a fixed, relatively small number of values, you are more likely to be able to fit the whole hash table in memory. You assign an id to each possible value of the attribute, and then for each id value store a big list of file offsets.

您可以通过只在哈希表中存储哈希值和文件偏移量来减少哈希表的大小。如果属性只有一个固定的、相对较少的值,那么您更有可能在内存中容纳整个哈希表。您为属性的每个可能值分配一个id,然后为每个id值存储一个文件偏移量的大列表。

Of course the hash table is only going to be helpful if, within the same run of the program, you do several different searches.

当然,哈希表只在相同的程序运行中执行不同的搜索时才有用。

The obvious solution would be to stuff the data in a database, but I assume that the OP is smart enough to have realized that already and has other reasons for specifically requesting a non-database solution to the problem.

显而易见的解决方案是将数据存储在数据库中,但我假设OP足够聪明,已经实现了这一点,并且有其他理由专门请求非数据库的解决方案。

#2


4  

If you only search once, you can't do better than O(n).

如果你只搜索一次,你不会比O(n)做得更好。

If a hash index is too big to fit in memory, use an on-disk hash like dbm or gdbm.

如果哈希索引太大,无法装入内存,可以使用dbm或gdbm之类的磁盘哈希。

EDIT: I'd like to point out that the Berkeley DB tool that KeithB suggested falls into this category on on-disk hashes. Berkeley DB is not a SQL-using relational database.

编辑:我想指出KeithB建议的Berkeley DB工具在磁盘哈希上属于这个类别。Berkeley DB不是使用sql的关系数据库。

#3


3  

You could use Berkley DB to index this file. Basically, go through the entire file once, and for each attribute found, store the file position of the attribute. Berkley DB uses an efficient B-Tree implementation ,and does not require storing the entire index in memory, just the parts that are needed. I've done this before with good success.

您可以使用Berkley DB索引这个文件。基本上,只需遍历整个文件一次,对于找到的每个属性,存储属性的文件位置。Berkley DB使用一种高效的B-Tree实现,不需要将整个索引存储在内存中,只需要那些需要的部分。我以前做过这件事,很成功。

#4


2  

Sounds like a job for a database system.

听起来像是数据库系统的工作。

How can I possibly index the entire file on various attributes?

如何将整个文件索引到不同的属性?

You are really left off to implementing a DB solution under the hood. Your best bets are probably some off-line search algorithms and maintaining an index file.

您真的不需要在后台实现DB解决方案。你最好的选择可能是一些离线搜索算法和维护索引文件。

You may also find Map-Reduce of interest.

你也可以找到mapreduce的兴趣。

#5


2  

There is no efficient method for directly searching a text file of variable length lines. Most efficient methods require an overhead to mutate the data into a form better suited to a more efficient search method. This overhead may not be worthwhile for infrequent searches.

没有一种有效的方法可以直接搜索长度可变的文本文件。大多数有效的方法都需要开销,以便将数据转换成更适合于更有效搜索方法的表单。这种开销可能不值得经常搜索。

Infrequent Searching

If the file is only searched once or not very frequently, I suggest a brute force line by line search. Other methods may waste time setting up the data for a faster search. For example, using your program to find the first line containing one or more attributes.

如果文件只被搜索一次或不经常搜索,我建议逐行搜索使用蛮力。其他方法可能会浪费时间来设置数据,以实现更快的搜索。例如,使用程序查找包含一个或多个属性的第一行。

Frequent searching

The program is required to search the file many times. In this case, you may want to set up some data structures for better searching. One technique is to create index files. These files contain file positions of attributes, sorted by attribute. Something like [attribute][first occurance][second occurance], etc.

该程序需要多次搜索文件。在这种情况下,您可能希望建立一些数据结构,以便更好地搜索。一种技术是创建索引文件。这些文件包含属性的文件位置,按属性排序。类似[属性][第一次出现][第二次出现]等。

Another alternative is to have your program convert the file into a format that a better tool can use, such as a spreadsheet or a database. One example is Comma Separate Values, or some tools like to have the values separated by a '|'.

另一种选择是让程序将文件转换为一种更好的工具可以使用的格式,例如电子表格或数据库。一个例子是逗号分隔值,或者一些工具,比如用'|'分隔值。

Changing the Generator

And yet, the program that generated the log file could be changed to generate spreadsheet or database friendly log files. I did this with an embedded system. I fed the data into the spreadsheet and used spreadsheet functions to analyze the data. Much easier that writing a program to search (and analyze) the log file.

然而,生成日志文件的程序可以更改为生成电子表格或数据库友好的日志文件。我是用嵌入式系统做的。我将数据输入电子表格并使用电子表格函数来分析数据。编写程序搜索(和分析)日志文件要容易得多。

#6


0  

Do the search in a separate thread, it may take a while but I think the trade off of building and then searching a hash table is not worth it in this case.

在一个单独的线程中进行搜索,可能需要一段时间,但是我认为构建然后搜索哈希表在这种情况下是不值得的。

If the user was repeatedly searching for different attributes it may be worth while, but I doubt it.

如果用户反复地搜索不同的属性,那么它可能是值得的,但我对此表示怀疑。

For the searching try boost::regex or QRegExp.

对于搜索尝试boost::regex或QRegExp。

For just searching a log file, I think you might be over-engineering this with a database or hash table.

对于搜索日志文件,我认为您可能对数据库或哈希表进行了过多的设计。

#7


0  

I would build an index using a b-tree or hash table. I would store the index on disk. If you are in control of updates to this file you can also update the index when the file is updated. If you are not in control of updates then use a hash of the file to determine if it needs to be regenerated.

我将使用b树或哈希表构建索引。我将索引存储在磁盘上。如果您能够控制对这个文件的更新,您也可以在文件更新时更新索引。如果您无法控制更新,那么使用文件的散列来确定是否需要重新生成它。

After thinking about this I realized that I commonly search log files with 600k+ lines (100M) using grep on the command line. It is quite fast. So, I don't think 100k is a problem unless your files have a lot more data.

考虑到这一点,我意识到我通常在命令行中使用grep搜索600k+行(100M)的日志文件。这是相当快。所以,我不认为100k是一个问题,除非你的文件有更多的数据。

You might also want to look into using log rotate if the log files are getting large.

如果日志文件越来越大,您可能还需要查看使用日志旋转。

#8


0  

You could try a straightforward linear search of the file, but launch a worker thread to do your searching. That way, you can spawn multiple worker threads and start them at different points in the file. On a machine with multiple processors/cores, this parallelism should produce a performance boost over a plain, single-threaded linear search.

您可以尝试对文件进行简单的线性搜索,但是启动一个worker线程进行搜索。这样,您可以派生多个工作线程,并在文件的不同位置启动它们。在具有多个处理器/内核的机器上,这种并行性应该比普通的单线程线性搜索产生性能提升。

After you do a search, you may want to store the input parameters and the results of that search in a data file. That way, if the search is repeated in the future, you can use this cached data and will only have to search through the part of the file modified since the last search.

在进行搜索之后,您可能希望将输入参数和搜索结果存储在数据文件中。这样,如果以后重复搜索,您可以使用这个缓存的数据,并且只需要搜索自上次搜索以来修改过的文件的部分。

#9


0  

Don't worry. Just scan it.

别担心。只是扫描。

Seriously. You say this file is 100K lines, which is indeed almost exactly the same size as the /var/log/messages on the computer I'm typing this on. Well, this is a netbook, i.e. very slow by modern standards -- and yet a straightforward grep of my /var/log/messages is so quick that it is dwarfed by the amount of time taken to print the results to the screen.

认真对待。你说这个文件是100K行,它的大小和我在电脑上输入的/var/log/messages的大小差不多。嗯,这是一个上网本,按照现代标准来说非常慢——但是我的/var/log/message的一个简单的grep是如此之快,以至于它与打印到屏幕上的结果所花费的时间相比简直是小巫见大巫。

So I really don't think you have anything to worry about -- particularly not if this is an interactive process that only runs searches when a human demands it, not some daemon that might be constantly searching for things in the background. You've quite possibly already wasted more time worrying about this problem than you could ever hope to save by implementing an index!

所以我真的不认为你有什么需要担心的——特别是如果这是一个交互过程,只在人类需要的时候运行搜索,而不是某个守护进程在后台不断搜索。您很可能已经浪费了更多的时间来担心这个问题,而不是希望通过实现索引来节省时间。

#10


0  

Come on, seriously guys. A database for log files?

来吧,认真的人。日志文件的数据库?

Log files change all the time or at the least every day. So what you really probably want is to do some log file rotation of some kind, of which there's many premade and failing that could be done in a few hours if you know even a little perl. You probably really don't need C++ for this, either. It will just make dev time slower and the end result buggier.

日志文件一直在变化,或者至少每天都在变化。所以你可能真正想要做的是做一些日志文件的旋转,其中有很多是预先制作的,如果你知道一点perl的话,这可以在几个小时内完成。你可能也不需要c++。它只会使dev时间变慢,最终导致buggier。

#1


2  

You can reduce the size of the hash table by only storing hash values and file offsets in it. If the attributes only have a fixed, relatively small number of values, you are more likely to be able to fit the whole hash table in memory. You assign an id to each possible value of the attribute, and then for each id value store a big list of file offsets.

您可以通过只在哈希表中存储哈希值和文件偏移量来减少哈希表的大小。如果属性只有一个固定的、相对较少的值,那么您更有可能在内存中容纳整个哈希表。您为属性的每个可能值分配一个id,然后为每个id值存储一个文件偏移量的大列表。

Of course the hash table is only going to be helpful if, within the same run of the program, you do several different searches.

当然,哈希表只在相同的程序运行中执行不同的搜索时才有用。

The obvious solution would be to stuff the data in a database, but I assume that the OP is smart enough to have realized that already and has other reasons for specifically requesting a non-database solution to the problem.

显而易见的解决方案是将数据存储在数据库中,但我假设OP足够聪明,已经实现了这一点,并且有其他理由专门请求非数据库的解决方案。

#2


4  

If you only search once, you can't do better than O(n).

如果你只搜索一次,你不会比O(n)做得更好。

If a hash index is too big to fit in memory, use an on-disk hash like dbm or gdbm.

如果哈希索引太大,无法装入内存,可以使用dbm或gdbm之类的磁盘哈希。

EDIT: I'd like to point out that the Berkeley DB tool that KeithB suggested falls into this category on on-disk hashes. Berkeley DB is not a SQL-using relational database.

编辑:我想指出KeithB建议的Berkeley DB工具在磁盘哈希上属于这个类别。Berkeley DB不是使用sql的关系数据库。

#3


3  

You could use Berkley DB to index this file. Basically, go through the entire file once, and for each attribute found, store the file position of the attribute. Berkley DB uses an efficient B-Tree implementation ,and does not require storing the entire index in memory, just the parts that are needed. I've done this before with good success.

您可以使用Berkley DB索引这个文件。基本上,只需遍历整个文件一次,对于找到的每个属性,存储属性的文件位置。Berkley DB使用一种高效的B-Tree实现,不需要将整个索引存储在内存中,只需要那些需要的部分。我以前做过这件事,很成功。

#4


2  

Sounds like a job for a database system.

听起来像是数据库系统的工作。

How can I possibly index the entire file on various attributes?

如何将整个文件索引到不同的属性?

You are really left off to implementing a DB solution under the hood. Your best bets are probably some off-line search algorithms and maintaining an index file.

您真的不需要在后台实现DB解决方案。你最好的选择可能是一些离线搜索算法和维护索引文件。

You may also find Map-Reduce of interest.

你也可以找到mapreduce的兴趣。

#5


2  

There is no efficient method for directly searching a text file of variable length lines. Most efficient methods require an overhead to mutate the data into a form better suited to a more efficient search method. This overhead may not be worthwhile for infrequent searches.

没有一种有效的方法可以直接搜索长度可变的文本文件。大多数有效的方法都需要开销,以便将数据转换成更适合于更有效搜索方法的表单。这种开销可能不值得经常搜索。

Infrequent Searching

If the file is only searched once or not very frequently, I suggest a brute force line by line search. Other methods may waste time setting up the data for a faster search. For example, using your program to find the first line containing one or more attributes.

如果文件只被搜索一次或不经常搜索,我建议逐行搜索使用蛮力。其他方法可能会浪费时间来设置数据,以实现更快的搜索。例如,使用程序查找包含一个或多个属性的第一行。

Frequent searching

The program is required to search the file many times. In this case, you may want to set up some data structures for better searching. One technique is to create index files. These files contain file positions of attributes, sorted by attribute. Something like [attribute][first occurance][second occurance], etc.

该程序需要多次搜索文件。在这种情况下,您可能希望建立一些数据结构,以便更好地搜索。一种技术是创建索引文件。这些文件包含属性的文件位置,按属性排序。类似[属性][第一次出现][第二次出现]等。

Another alternative is to have your program convert the file into a format that a better tool can use, such as a spreadsheet or a database. One example is Comma Separate Values, or some tools like to have the values separated by a '|'.

另一种选择是让程序将文件转换为一种更好的工具可以使用的格式,例如电子表格或数据库。一个例子是逗号分隔值,或者一些工具,比如用'|'分隔值。

Changing the Generator

And yet, the program that generated the log file could be changed to generate spreadsheet or database friendly log files. I did this with an embedded system. I fed the data into the spreadsheet and used spreadsheet functions to analyze the data. Much easier that writing a program to search (and analyze) the log file.

然而,生成日志文件的程序可以更改为生成电子表格或数据库友好的日志文件。我是用嵌入式系统做的。我将数据输入电子表格并使用电子表格函数来分析数据。编写程序搜索(和分析)日志文件要容易得多。

#6


0  

Do the search in a separate thread, it may take a while but I think the trade off of building and then searching a hash table is not worth it in this case.

在一个单独的线程中进行搜索,可能需要一段时间,但是我认为构建然后搜索哈希表在这种情况下是不值得的。

If the user was repeatedly searching for different attributes it may be worth while, but I doubt it.

如果用户反复地搜索不同的属性,那么它可能是值得的,但我对此表示怀疑。

For the searching try boost::regex or QRegExp.

对于搜索尝试boost::regex或QRegExp。

For just searching a log file, I think you might be over-engineering this with a database or hash table.

对于搜索日志文件,我认为您可能对数据库或哈希表进行了过多的设计。

#7


0  

I would build an index using a b-tree or hash table. I would store the index on disk. If you are in control of updates to this file you can also update the index when the file is updated. If you are not in control of updates then use a hash of the file to determine if it needs to be regenerated.

我将使用b树或哈希表构建索引。我将索引存储在磁盘上。如果您能够控制对这个文件的更新,您也可以在文件更新时更新索引。如果您无法控制更新,那么使用文件的散列来确定是否需要重新生成它。

After thinking about this I realized that I commonly search log files with 600k+ lines (100M) using grep on the command line. It is quite fast. So, I don't think 100k is a problem unless your files have a lot more data.

考虑到这一点,我意识到我通常在命令行中使用grep搜索600k+行(100M)的日志文件。这是相当快。所以,我不认为100k是一个问题,除非你的文件有更多的数据。

You might also want to look into using log rotate if the log files are getting large.

如果日志文件越来越大,您可能还需要查看使用日志旋转。

#8


0  

You could try a straightforward linear search of the file, but launch a worker thread to do your searching. That way, you can spawn multiple worker threads and start them at different points in the file. On a machine with multiple processors/cores, this parallelism should produce a performance boost over a plain, single-threaded linear search.

您可以尝试对文件进行简单的线性搜索,但是启动一个worker线程进行搜索。这样,您可以派生多个工作线程,并在文件的不同位置启动它们。在具有多个处理器/内核的机器上,这种并行性应该比普通的单线程线性搜索产生性能提升。

After you do a search, you may want to store the input parameters and the results of that search in a data file. That way, if the search is repeated in the future, you can use this cached data and will only have to search through the part of the file modified since the last search.

在进行搜索之后,您可能希望将输入参数和搜索结果存储在数据文件中。这样,如果以后重复搜索,您可以使用这个缓存的数据,并且只需要搜索自上次搜索以来修改过的文件的部分。

#9


0  

Don't worry. Just scan it.

别担心。只是扫描。

Seriously. You say this file is 100K lines, which is indeed almost exactly the same size as the /var/log/messages on the computer I'm typing this on. Well, this is a netbook, i.e. very slow by modern standards -- and yet a straightforward grep of my /var/log/messages is so quick that it is dwarfed by the amount of time taken to print the results to the screen.

认真对待。你说这个文件是100K行,它的大小和我在电脑上输入的/var/log/messages的大小差不多。嗯,这是一个上网本,按照现代标准来说非常慢——但是我的/var/log/message的一个简单的grep是如此之快,以至于它与打印到屏幕上的结果所花费的时间相比简直是小巫见大巫。

So I really don't think you have anything to worry about -- particularly not if this is an interactive process that only runs searches when a human demands it, not some daemon that might be constantly searching for things in the background. You've quite possibly already wasted more time worrying about this problem than you could ever hope to save by implementing an index!

所以我真的不认为你有什么需要担心的——特别是如果这是一个交互过程,只在人类需要的时候运行搜索,而不是某个守护进程在后台不断搜索。您很可能已经浪费了更多的时间来担心这个问题,而不是希望通过实现索引来节省时间。

#10


0  

Come on, seriously guys. A database for log files?

来吧,认真的人。日志文件的数据库?

Log files change all the time or at the least every day. So what you really probably want is to do some log file rotation of some kind, of which there's many premade and failing that could be done in a few hours if you know even a little perl. You probably really don't need C++ for this, either. It will just make dev time slower and the end result buggier.

日志文件一直在变化,或者至少每天都在变化。所以你可能真正想要做的是做一些日志文件的旋转,其中有很多是预先制作的,如果你知道一点perl的话,这可以在几个小时内完成。你可能也不需要c++。它只会使dev时间变慢,最终导致buggier。