sed优化(基于较小数据集的大文件修改)

时间:2021-08-16 21:32:33

I do have to deal with very large plain text files (over 10 gigabytes, yeah I know it depends what we should call large), with very long lines.

我必须处理非常大的纯文本文件(超过10千兆字节,是的,我知道它取决于我们应该称之为大的),行很长。

My most recent task involves some line editing based on data from another file.

我最近的任务涉及基于另一个文件的数据进行一些行编辑。

The data file (which should be modified) contains 1500000 lines, each of them are e.g. 800 chars long. Each line is unique, and contains only one identity number, each identity number is unique)

数据文件(应该被修改)包含1500000行,每行都是例如800个字符长。每一行都是唯一的,只包含一个身份证号码,每个身份证号码都是唯一的)

The modifier file is e.g. 1800 lines long, contains an identity number, and an amount and a date which should be modified in the data file.

修饰符文件例如是1800行,包含一个标识号,以及应在数据文件中修改的数量和日期。

I just transformed (with Vim regex) the modifier file to sed, but it's very inefficient.

我只是将(使用Vim正则表达式)修改器文件转换为sed,但效率非常低。

Let's say I have a line like this in the data file:

假设我在数据文件中有这样的一行:

(some 500 character)id_number(some 300 character)

And I need to modify data in the 300 char part.

我需要修改300个字符的数据。

Based on the modifier file, I come up with sed lines like this:

基于修饰符文件,我想出了像这样的sed行:

/id_number/ s/^\(.\{650\}\).\{20\}/\1CHANGED_AMOUNT_AND_DATA/

So I have 1800 lines like this.

所以我有1800条这样的线。

But I know, that even on a very fast server, if I do a

但我知道,即使是在速度非常快的服务器上,如果我这么做的话

sed -i.bak -f modifier.sed data.file

It's very slow, because it has to read every pattern x every line.

它非常慢,因为它必须读取每一行的每个模式。

Isn't there a better way?

有没有更好的方法?

Note: I'm not a programmer, had never learnt (in school) about algorithms. I can use awk, sed, an outdated version of perl on the server.

注意:我不是程序员,从未学过(在学校里)算法。我可以在服务器上使用awk,sed,一个过时的perl版本。

6 个解决方案

#1


My suggested approaches (in order of desirably) would be to process this data as:

我建议的方法(按照希望的顺序)将处理这些数据:

  1. A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
  2. 一个数据库(即使是一个带有索引的基于SQLite的简单DB,也会比10GB文件上的sed / awk好得多)

  3. A flat file containing fixed record lengths
  4. 包含固定记录长度的平面文件

  5. A flat file containing variable record lengths
  6. 包含可变记录长度的平面文件

Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.

使用数据库可以处理所有那些减慢文本文件处理速度的细节(找到您关心的记录,修改数据,将其存储回数据库)。在Perl的情况下查看DBD :: SQLite。

If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?

如果你想坚持使用平面文件,你需要在大文件旁边手动维护一个索引,这样你就可以更容易地查找你需要操作的记录号。或者,更好的是,也许您的身份证号码是您的记录号码?

If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.

如果你有可变的记录长度,我建议转换为固定记录长度(因为它看起来只有你的ID是可变长度)。如果你不能这样做,也许任何现有的数据都不会在文件中移动?然后你可以维护前面提到的索引并根据需要添加新条目,区别在于,不是指向记录号的索引,而是指向文件中的绝对位置。

#2


I suggest you a programm written in Perl (as I am not a sed/awk guru and I don't what they are exactly capable of).

我建议你用Perl编写一个程序(因为我不是一个sed / awk大师,我不是他们完全有能力的)。

You "algorithm" is simple: you need to construct, first of all, an hashmap which could give you the new data string to apply for each ID. This is achieved reading the modifier file of course.

你的“算法”很简单:首先,你需要构建一个hashmap,它可以为你提供新的数据字符串来应用每个ID。这当然是通过读取修饰符文件来实现的。

Once this hasmap in populated you may browse each line of your data file, read the ID in the middle of the line, and generate the new line as you've described above.

填充完这个hasmap后,您可以浏览数据文件的每一行,读取行中间的ID,并生成如上所述的新行。

I am not a Perl guru too , but I think that the programm is quite simple. If you need help to write it, ask for it :-)

我也不是Perl大师,但我认为该程序非常简单。如果你需要帮助来写它,请问:-)

#3


With perl you should use substr to get id_number, especially if id_number has constant width.

使用perl你应该使用substr来获取id_number,特别是如果id_number具有恒定的宽度。

my $id_number=substr($str, 500, id_number_length);

After that if $id_number is in range, you should use substr to replace remaining text.

之后,如果$ id_number在范围内,则应使用substr替换剩余文本。

substr($str, -300,300, $new_text);

Perl's regular expressions are very fast, but not in this case.

Perl的正则表达式非常快,但在这种情况下不是这样。

#4


My suggestion is, don't use database. Well written perl script will outperform database in order of magnitude in this sort of task. Trust me, I have many practical experience with it. You will not have imported data into database when perl will be finished.

我的建议是,不要使用数据库。写得好的perl脚本在这类任务中的性能将超过数据库。相信我,我有很多实践经验。完成perl后,您不会将数据导入数据库。

When you write 1500000 lines with 800 chars it seems 1.2GB for me. If you will have very slow disk (30MB/s) you will read it in a 40 seconds. With better 50 -> 24s, 100 -> 12s and so. But perl hash lookup (like db join) speed on 2GHz CPU is above 5Mlookups/s. It means that your CPU bound work will be in seconds and you IO bound work will be in tens of seconds. If it is really 10GB numbers will change but proportion is same.

当你用800个字符写1500000行时,对我来说似乎是1.2GB。如果您的磁盘速度非常慢(30MB / s),您将在40秒内读取它。更好的50 - > 24s,100 - > 12s等。但是2GHz CPU上的perl哈希查找(比如db join)速度高于5Mlookups / s。这意味着您的CPU绑定工作将在几秒钟内完成,您的IO绑定工作将在几十秒内完成。如果真的是10GB数字会改变但比例相同。

You have not specified if data modification changes size or not (if modification can be done in place) thus we will not assume it and will work as filter. You have not specified what format of your "modifier file" and what sort of modification. Assume that it is separated by tab something like:

您尚未指定数据修改是否更改大小(如果可以进行修改),因此我们不会假设它并将作为过滤器。您尚未指定“修改器文件”的格式以及进行何种修改。假设它被标签分隔如下:

<id><tab><position_after_id><tab><amount><tab><data>

We will read data from stdin and write to stdout and script can be something like this:

我们将从stdin读取数据并写入stdout,脚本可以是这样的:

my $modifier_filename = 'modifier_file.txt';

open my $mf, '<', $modifier_filename or die "Can't open '$modifier_filename': $!";
my %modifications;
while (<$mf>) {
   chomp;
   my ($id, $position, $amount, $data) = split /\t/;
   $modifications{$id} = [$position, $amount, $data];
}
close $mf;

# make matching regexp (use quotemeta to prevent regexp meaningful characters)
my $id_regexp = join '|', map quotemeta, keys %modifications;
$id_regexp = qr/($id_regexp)/;     # compile regexp

while (<>) {
  next unless m/$id_regexp/;
  next unless $modifications{$1};
  my ($position, $amount, $data) = @{$modifications{$1}};
  substr $_, $+[1] + $position, $amount, $data;
}
continue { print }

On mine laptop it takes about half minute for 1.5 million rows, 1800 lookup ids, 1.2GB data. For 10GB it should not be over 5 minutes. Is it reasonable quick for you?

在我的笔记本电脑上,150万行,1800个查找ID,1.2GB数据大约需要半分钟。对于10GB,它不应该超过5分钟。这对你来说是否合理?

If you start think you are not IO bound (for example if use some NAS) but CPU bound you can sacrifice some readability and change to this:

如果你开始认为你不是IO绑定(例如,如果使用一些NAS)但CPU限制你可以牺牲一些可读性并改为:

my $mod;
while (<>) {
  next unless m/$id_regexp/;
  $mod = $modifications{$1};
  next unless $mod;
  substr $_, $+[1] + $mod->[0], $mod->[1], $mod->[2];
}
continue { print }

#5


You should almost certainly use a database, as MikeyB suggested.

你应该几乎肯定会使用数据库,正如MikeyB建议的那样。

If you don't want to use a database for some reason, then if the list of modifications will fit in memory (as it currently will at 1800 lines), the most efficient method is a hashtable populated with the modifications as suggested by yves Baumes.

如果您由于某种原因不想使用数据库,那么如果修改列表适合内存(因为它目前将在1800行),最有效的方法是填充yves Baumes建议的修改的哈希表。

If you get to the point where even the list of modifications becomes huge, you need to sort both files by their IDs and then perform a list merge -- basically:

如果你达到甚至修改列表变得很大的程度,你需要按照它们的ID对这两个文件进行排序,然后执行列表合并 - 基本上:

  1. Compare the ID at the "top" of the input file with the ID at the "top" of the modifications file
  2. 将输入文件“顶部”的ID与修改文件“顶部”的ID进行比较

  3. Adjust the record accordingly if they match
  4. 如果匹配,则相应地调整记录

  5. Write it out
  6. 把它写出来

  7. Discard the "top" line from whichever file had the (alphabetically or numerically) lowest ID and read another line from that file
  8. 从具有(按字母顺序或数字字母)最低ID的文件中丢弃“顶部”行,并从该文件中读取另一行

  9. Goto 1.

Behind the scenes, a database will almost certainly use a list merge if you perform this alteration using a single SQL UPDATE command.

在幕后,如果使用单个SQL UPDATE命令执行此更改,则数据库几乎肯定会使用列表合并。

#6


Good deal on the sqlloader or datadump decision. That's the way to go.

关于sqlloader或datadump决定的好处。这是要走的路。

#1


My suggested approaches (in order of desirably) would be to process this data as:

我建议的方法(按照希望的顺序)将处理这些数据:

  1. A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
  2. 一个数据库(即使是一个带有索引的基于SQLite的简单DB,也会比10GB文件上的sed / awk好得多)

  3. A flat file containing fixed record lengths
  4. 包含固定记录长度的平面文件

  5. A flat file containing variable record lengths
  6. 包含可变记录长度的平面文件

Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.

使用数据库可以处理所有那些减慢文本文件处理速度的细节(找到您关心的记录,修改数据,将其存储回数据库)。在Perl的情况下查看DBD :: SQLite。

If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?

如果你想坚持使用平面文件,你需要在大文件旁边手动维护一个索引,这样你就可以更容易地查找你需要操作的记录号。或者,更好的是,也许您的身份证号码是您的记录号码?

If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.

如果你有可变的记录长度,我建议转换为固定记录长度(因为它看起来只有你的ID是可变长度)。如果你不能这样做,也许任何现有的数据都不会在文件中移动?然后你可以维护前面提到的索引并根据需要添加新条目,区别在于,不是指向记录号的索引,而是指向文件中的绝对位置。

#2


I suggest you a programm written in Perl (as I am not a sed/awk guru and I don't what they are exactly capable of).

我建议你用Perl编写一个程序(因为我不是一个sed / awk大师,我不是他们完全有能力的)。

You "algorithm" is simple: you need to construct, first of all, an hashmap which could give you the new data string to apply for each ID. This is achieved reading the modifier file of course.

你的“算法”很简单:首先,你需要构建一个hashmap,它可以为你提供新的数据字符串来应用每个ID。这当然是通过读取修饰符文件来实现的。

Once this hasmap in populated you may browse each line of your data file, read the ID in the middle of the line, and generate the new line as you've described above.

填充完这个hasmap后,您可以浏览数据文件的每一行,读取行中间的ID,并生成如上所述的新行。

I am not a Perl guru too , but I think that the programm is quite simple. If you need help to write it, ask for it :-)

我也不是Perl大师,但我认为该程序非常简单。如果你需要帮助来写它,请问:-)

#3


With perl you should use substr to get id_number, especially if id_number has constant width.

使用perl你应该使用substr来获取id_number,特别是如果id_number具有恒定的宽度。

my $id_number=substr($str, 500, id_number_length);

After that if $id_number is in range, you should use substr to replace remaining text.

之后,如果$ id_number在范围内,则应使用substr替换剩余文本。

substr($str, -300,300, $new_text);

Perl's regular expressions are very fast, but not in this case.

Perl的正则表达式非常快,但在这种情况下不是这样。

#4


My suggestion is, don't use database. Well written perl script will outperform database in order of magnitude in this sort of task. Trust me, I have many practical experience with it. You will not have imported data into database when perl will be finished.

我的建议是,不要使用数据库。写得好的perl脚本在这类任务中的性能将超过数据库。相信我,我有很多实践经验。完成perl后,您不会将数据导入数据库。

When you write 1500000 lines with 800 chars it seems 1.2GB for me. If you will have very slow disk (30MB/s) you will read it in a 40 seconds. With better 50 -> 24s, 100 -> 12s and so. But perl hash lookup (like db join) speed on 2GHz CPU is above 5Mlookups/s. It means that your CPU bound work will be in seconds and you IO bound work will be in tens of seconds. If it is really 10GB numbers will change but proportion is same.

当你用800个字符写1500000行时,对我来说似乎是1.2GB。如果您的磁盘速度非常慢(30MB / s),您将在40秒内读取它。更好的50 - > 24s,100 - > 12s等。但是2GHz CPU上的perl哈希查找(比如db join)速度高于5Mlookups / s。这意味着您的CPU绑定工作将在几秒钟内完成,您的IO绑定工作将在几十秒内完成。如果真的是10GB数字会改变但比例相同。

You have not specified if data modification changes size or not (if modification can be done in place) thus we will not assume it and will work as filter. You have not specified what format of your "modifier file" and what sort of modification. Assume that it is separated by tab something like:

您尚未指定数据修改是否更改大小(如果可以进行修改),因此我们不会假设它并将作为过滤器。您尚未指定“修改器文件”的格式以及进行何种修改。假设它被标签分隔如下:

<id><tab><position_after_id><tab><amount><tab><data>

We will read data from stdin and write to stdout and script can be something like this:

我们将从stdin读取数据并写入stdout,脚本可以是这样的:

my $modifier_filename = 'modifier_file.txt';

open my $mf, '<', $modifier_filename or die "Can't open '$modifier_filename': $!";
my %modifications;
while (<$mf>) {
   chomp;
   my ($id, $position, $amount, $data) = split /\t/;
   $modifications{$id} = [$position, $amount, $data];
}
close $mf;

# make matching regexp (use quotemeta to prevent regexp meaningful characters)
my $id_regexp = join '|', map quotemeta, keys %modifications;
$id_regexp = qr/($id_regexp)/;     # compile regexp

while (<>) {
  next unless m/$id_regexp/;
  next unless $modifications{$1};
  my ($position, $amount, $data) = @{$modifications{$1}};
  substr $_, $+[1] + $position, $amount, $data;
}
continue { print }

On mine laptop it takes about half minute for 1.5 million rows, 1800 lookup ids, 1.2GB data. For 10GB it should not be over 5 minutes. Is it reasonable quick for you?

在我的笔记本电脑上,150万行,1800个查找ID,1.2GB数据大约需要半分钟。对于10GB,它不应该超过5分钟。这对你来说是否合理?

If you start think you are not IO bound (for example if use some NAS) but CPU bound you can sacrifice some readability and change to this:

如果你开始认为你不是IO绑定(例如,如果使用一些NAS)但CPU限制你可以牺牲一些可读性并改为:

my $mod;
while (<>) {
  next unless m/$id_regexp/;
  $mod = $modifications{$1};
  next unless $mod;
  substr $_, $+[1] + $mod->[0], $mod->[1], $mod->[2];
}
continue { print }

#5


You should almost certainly use a database, as MikeyB suggested.

你应该几乎肯定会使用数据库,正如MikeyB建议的那样。

If you don't want to use a database for some reason, then if the list of modifications will fit in memory (as it currently will at 1800 lines), the most efficient method is a hashtable populated with the modifications as suggested by yves Baumes.

如果您由于某种原因不想使用数据库,那么如果修改列表适合内存(因为它目前将在1800行),最有效的方法是填充yves Baumes建议的修改的哈希表。

If you get to the point where even the list of modifications becomes huge, you need to sort both files by their IDs and then perform a list merge -- basically:

如果你达到甚至修改列表变得很大的程度,你需要按照它们的ID对这两个文件进行排序,然后执行列表合并 - 基本上:

  1. Compare the ID at the "top" of the input file with the ID at the "top" of the modifications file
  2. 将输入文件“顶部”的ID与修改文件“顶部”的ID进行比较

  3. Adjust the record accordingly if they match
  4. 如果匹配,则相应地调整记录

  5. Write it out
  6. 把它写出来

  7. Discard the "top" line from whichever file had the (alphabetically or numerically) lowest ID and read another line from that file
  8. 从具有(按字母顺序或数字字母)最低ID的文件中丢弃“顶部”行,并从该文件中读取另一行

  9. Goto 1.

Behind the scenes, a database will almost certainly use a list merge if you perform this alteration using a single SQL UPDATE command.

在幕后,如果使用单个SQL UPDATE命令执行此更改,则数据库几乎肯定会使用列表合并。

#6


Good deal on the sqlloader or datadump decision. That's the way to go.

关于sqlloader或datadump决定的好处。这是要走的路。