Linux:按具有相同值的列合并行

时间:2021-02-02 13:17:22

Is there any way to merge rows like:

有没有办法合并行如:

7072;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7079;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7091;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7113;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7128;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7159;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7168;understand;-;F;18;IT;MN;2009-03-18 00:00:00

into just one:

只有一个:

7072;understand;-;F;18;IT;MN;2009-03-18 00:00:00

Basically, i need to:
1. Get the numbers(IDs) from the 1st column which have same values on 2,n columns (7072, 7079,7091) 2. Remove duplicates: leave just first one (7072) there are also other entries like

基本上,我需要:1。获取第1列中的数字(ID),它们在2,n列(7072,7079,7091)上具有相同的值.2。删除重复项:只留下第一列(7072)还有其他条目如

7072;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7079;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7091;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7113;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7128;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7159;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7168;mistify;-;F;18;IT;MN;2009-03-18 00:00:00

I need to leave 7072 only. Finally, it seems like I have to get those numbers and do substitution like

我只需要离开7072。最后,似乎我必须获得这些数字并做替换

sed 's/^id;.*//g' 

2 个解决方案

#1


1  

To remove duplicates based on second column (that is understand and mystify) you can use the following awk script to keep the first copy of the line and filter everything else:

要删除基于第二列的重复项(理解和神秘),您可以使用以下awk脚本来保留行的第一个副本并过滤其他所有内容:

awk -F';' '!seen[$2]++' file

For the file like the following:

对于如下文件:

$ cat file
7072;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7079;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7091;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7113;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7128;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7159;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7168;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7072;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7079;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7091;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7113;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7128;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7159;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7168;mistify;-;F;18;IT;MN;2009-03-18 00:00:00

It will produce an output of (keeping just the first occurrence and filtering everything else)

它将产生一个输出(仅保留第一次出现并过滤其他所有内容)

$ awk -F';' '!seen[$2]++' file
7072;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7072;mistify;-;F;18;IT;MN;2009-03-18 00:00:00

We create an array seen and use second column as the key. When the line is seen the first time, count of it in our array is zero so we negate it so that the value is 1 and we print it by default. All subsequent times the value will be greater than 0 so we negate that to make it 0 and filter it.

我们创建了一个看到的数组,并使用第二列作为键。当第一次看到这条线时,我们数组中它的计数为零,所以我们否定它,使得值为1,我们默认打印它。随后的所有时间值都将大于0,因此我们否定将其设为0并对其进行过滤。

If this is not what you want, please update your question to show what your desired output is based on some sample data.

如果这不是您想要的,请更新您的问题,以根据一些示例数据显示您所需的输出。

#2


0  

Unless I do not understand your question .. the following will give you the output you ask for:

除非我不明白你的问题..以下内容将为您提供您要求的输出:

$ uniq -s 4 input.txt |cut -d ";" -f 1

7072
7072

#1


1  

To remove duplicates based on second column (that is understand and mystify) you can use the following awk script to keep the first copy of the line and filter everything else:

要删除基于第二列的重复项(理解和神秘),您可以使用以下awk脚本来保留行的第一个副本并过滤其他所有内容:

awk -F';' '!seen[$2]++' file

For the file like the following:

对于如下文件:

$ cat file
7072;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7079;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7091;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7113;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7128;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7159;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7168;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7072;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7079;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7091;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7113;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7128;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7159;mistify;-;F;18;IT;MN;2009-03-18 00:00:00
7168;mistify;-;F;18;IT;MN;2009-03-18 00:00:00

It will produce an output of (keeping just the first occurrence and filtering everything else)

它将产生一个输出(仅保留第一次出现并过滤其他所有内容)

$ awk -F';' '!seen[$2]++' file
7072;understand;-;F;18;IT;MN;2009-03-18 00:00:00
7072;mistify;-;F;18;IT;MN;2009-03-18 00:00:00

We create an array seen and use second column as the key. When the line is seen the first time, count of it in our array is zero so we negate it so that the value is 1 and we print it by default. All subsequent times the value will be greater than 0 so we negate that to make it 0 and filter it.

我们创建了一个看到的数组,并使用第二列作为键。当第一次看到这条线时,我们数组中它的计数为零,所以我们否定它,使得值为1,我们默认打印它。随后的所有时间值都将大于0,因此我们否定将其设为0并对其进行过滤。

If this is not what you want, please update your question to show what your desired output is based on some sample data.

如果这不是您想要的,请更新您的问题,以根据一些示例数据显示您所需的输出。

#2


0  

Unless I do not understand your question .. the following will give you the output you ask for:

除非我不明白你的问题..以下内容将为您提供您要求的输出:

$ uniq -s 4 input.txt |cut -d ";" -f 1

7072
7072