awk / sed / grep删除与另一个文件中的字段匹配的行

时间:2021-09-10 16:52:05

I have a file1, that has a few lines (tens), and a much longer file2 (~500,000 lines). The lines in each file are not identical, although there is a subset of fields that are identical. I want to take fields 3-5 from each line in file1, and search file2 for the same pattern (just those three fields, in same order -- in file2, they fall in fields 2-4). If any match is found, then I want to delete the corresponding line from file1.

我有一个file1,有几行(十行)和一个更长的file2(~500,000行)。尽管存在相同的字段子集,但每个文件中的行不相同。我想从file1中的每一行获取3-5个字段,并搜索file2以获得相同的模式(只有这三个字段,按相同的顺序 - 在file2中,它们属于字段2-4)。如果找到任何匹配项,那么我想从file1中删除相应的行。

Eg, file1:

2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current
2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current
2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt
2016-01-06T07:53:50 2016-01-06T07:52:14 2016006 090E A TM Current
2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current

file2:

2016-01-06T07:35:06.87 2016003 100E C NN Current 0
2016-01-06T07:35:09.97 2016003 100E B TM Current 6303
2016-01-06T07:36:23.12 2016004 030N C TM Current 0
2016-01-06T07:37:57.36 2016006 090E A TM Current 399
2016-01-06T07:40:29.61 2016006 010N C TM Current 0

... (and on for 500,000 lines)

......(以及500,000行)

So in this case, I want to delete the fourth line of file1 (in place).

所以在这种情况下,我想删除file1的第四行(就地)。

The following finds the lines I want to delete:

以下查找我要删除的行:

grep "$(awk '{print $3,$4,$5}' file1)" file2

So one solution may be to pipe this to sed, but I'm unclear how to set a match pattern in sed from a piped input. And searching online suggests awk can probably do all of this (or perhaps sed, or something else), so wondering what a clean solution would look like.

所以一个解决方案可能是把它管道化为sed,但我不清楚如何在sed中设置匹配模式来自管道输入。在网上搜索建议awk可能会完成所有这些(或者也许是sed,或其他东西),所以想知道什么是干净的解决方案。

Also, speed is somewhat important because other processes may attempt to modify the files while this is going on (I know this may present more complications...). Matches will generally be found at the end of file2, not the beginning (in case there is some way to search file2 from the bottom up).

此外,速度有点重要,因为其他进程可能会尝试修改文件(我知道这可能会带来更多的复杂性......)。匹配通常位于file2的末尾,而不是开头(如果有一些方法可以从下往上搜索file2)。

2 个解决方案

#1


4  

$ awk 'NR==FNR{file2[$2,$3,$4]; next} !(($3,$4,$5) in file2)' file2 file1
2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current
2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current
2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt
2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current

The fact that file2 contains 500,000 lines should be no problem for awk wrt memory or execution speed - it should complete in about 1 second or less even in the worst case.

file2包含500,000行的事实对于awk wrt内存或执行速度应该没问题 - 即使在最坏的情况下也应该在大约1秒或更短的时间内完成。

With any UNIX command, to overwrite the original file you just do:

使用任何UNIX命令,您只需覆盖原始文件:

cmd file > tmp && mv tmp file

so in this case:

所以在这种情况下:

awk '...' file2 file1 > tmp && mv tmp file1

#2


1  

You can find non-matching lines in file1:

你可以在file1中找到不匹配的行:

$ grep -v -F -f <(awk '{ print $3,$4,$5 }' file2) file1
2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current
2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current
2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt
2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current

Just redirect this somewhere and overwrite file1 afterwards.

只需将其重定向到某处并随后覆盖file1。

#1


4  

$ awk 'NR==FNR{file2[$2,$3,$4]; next} !(($3,$4,$5) in file2)' file2 file1
2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current
2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current
2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt
2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current

The fact that file2 contains 500,000 lines should be no problem for awk wrt memory or execution speed - it should complete in about 1 second or less even in the worst case.

file2包含500,000行的事实对于awk wrt内存或执行速度应该没问题 - 即使在最坏的情况下也应该在大约1秒或更短的时间内完成。

With any UNIX command, to overwrite the original file you just do:

使用任何UNIX命令,您只需覆盖原始文件:

cmd file > tmp && mv tmp file

so in this case:

所以在这种情况下:

awk '...' file2 file1 > tmp && mv tmp file1

#2


1  

You can find non-matching lines in file1:

你可以在file1中找到不匹配的行:

$ grep -v -F -f <(awk '{ print $3,$4,$5 }' file2) file1
2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current
2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current
2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt
2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current

Just redirect this somewhere and overwrite file1 afterwards.

只需将其重定向到某处并随后覆盖file1。