在递归目录中按名称查找文件的重复项 - Linux

时间:2021-08-09 07:38:06

I have a folder which contains sub folders and some more files in them.

我有一个文件夹,其中包含子文件夹和更多文件。

The files are named in the following way

文件以下列方式命名

abc.DEF.xxxxxx.dat

I'm trying to find the duplicate files only matching 'xxxxxx' in the above pattern ignoring the rest. The extension .dat doesn't change. But the length of abc and DEF might change. The order of separation by periods also doesn't change.

我试图在上面的模式中找到仅匹配'xxxxxx'的重复文件而忽略其余的。扩展名.dat不会更改。但是abc和DEF的长度可能会改变。按期间的分离顺序也不会改变。

I'm guessing I need to use Find in the following way

我猜我需要以下列方式使用Find

find -regextype posix-extended -regex '\w+\.\w+\.\w+\.dat'

I need help coming up with the regular expression. Thanks.

我需要帮助提出正则表达式。谢谢。

Example: For a file named 'epg.ktt.crwqdd.dat', I need to find duplicate files containing 'crwqdd'.

示例:对于名为“epg.ktt.crwqdd.dat”的文件,我需要查找包含“crwqdd”的重复文件。

1 个解决方案

#1


1  

You can use awk for that:

您可以使用awk:

find /path -type f -name '*.dat' | awk -F. 'a[$4]++'

Explanation:

Let find give the following output:

让我们给出以下输出:

./abd.DdF.TTDFDF.dat
./cdd.DxdsdF.xxxxxx.dat
./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat

Basically, spoken with the words of a computer, you want to count the occurrences of a pattern between .dat and the next dot and print those lines where pattern appeared at least the second time.

基本上,用计算机的单词说出,你想要计算.dat和下一个点之间的模式的出现次数,并打印出至少第二次出现模式的那些行。

To achieve this we split the file names by the . what gives us 5(!) fields:

为此,我们将文件名拆分为。什么给了我们5(!)字段:

 echo ./abd.DEF.xxxxxx.dat | awk -F. '{print $1 " " $2 " " $3 " " $4  " " $5}'
  /abd DEF xxxxxx dat

Note the first, empty field. The pattern of interest is $4.

注意第一个空字段。感兴趣的模式是4美元。

To count the occurrences of a pattern in $4 we use an associative array a and increment it's value on each occurrence. Unoptimized, the awk command will look like:

要计算$ 4中模式的出现次数,我们使用关联数组a并在每次出现时递增它的值。未经优化,awk命令将如下所示:

... | awk -F. '{{if(a[$4]++ > 1){print}}'

However, you can write an awk program in the form:

但是,您可以使用以下格式编写awk程序:

CONDITION { ACTION }

What will give us:

什么会给我们:

... | awk -F. 'a[$4]++ > 1 {print}'

print is the default action in awk. It prints the whole current line. As it is the default action it can be omitted. Also the >1 check can be omitted because awk treats integer values greater than zero as true. This gives us the final command:

print是awk中的默认操作。它打印整个当前行。由于它是默认操作,因此可以省略。此外,> 1检查可以省略,因为awk将大于零的整数值视为true。这给了我们最后的命令:

... | awk -F. 'a[$4]++' 

To generalize the command we can say the pattern of interest isn't the 4th column, it is the next to last column. This can be expressed using number of fields in awk its NF:

为了概括命令我们可以说感兴趣的模式不是第4列,它是倒数第二列。这可以用awk中NF的字段数来表示:

... | awk -F. 'a[$(NF-1)]++'

Output:

./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat

#1


1  

You can use awk for that:

您可以使用awk:

find /path -type f -name '*.dat' | awk -F. 'a[$4]++'

Explanation:

Let find give the following output:

让我们给出以下输出:

./abd.DdF.TTDFDF.dat
./cdd.DxdsdF.xxxxxx.dat
./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat

Basically, spoken with the words of a computer, you want to count the occurrences of a pattern between .dat and the next dot and print those lines where pattern appeared at least the second time.

基本上,用计算机的单词说出,你想要计算.dat和下一个点之间的模式的出现次数,并打印出至少第二次出现模式的那些行。

To achieve this we split the file names by the . what gives us 5(!) fields:

为此,我们将文件名拆分为。什么给了我们5(!)字段:

 echo ./abd.DEF.xxxxxx.dat | awk -F. '{print $1 " " $2 " " $3 " " $4  " " $5}'
  /abd DEF xxxxxx dat

Note the first, empty field. The pattern of interest is $4.

注意第一个空字段。感兴趣的模式是4美元。

To count the occurrences of a pattern in $4 we use an associative array a and increment it's value on each occurrence. Unoptimized, the awk command will look like:

要计算$ 4中模式的出现次数,我们使用关联数组a并在每次出现时递增它的值。未经优化,awk命令将如下所示:

... | awk -F. '{{if(a[$4]++ > 1){print}}'

However, you can write an awk program in the form:

但是,您可以使用以下格式编写awk程序:

CONDITION { ACTION }

What will give us:

什么会给我们:

... | awk -F. 'a[$4]++ > 1 {print}'

print is the default action in awk. It prints the whole current line. As it is the default action it can be omitted. Also the >1 check can be omitted because awk treats integer values greater than zero as true. This gives us the final command:

print是awk中的默认操作。它打印整个当前行。由于它是默认操作,因此可以省略。此外,> 1检查可以省略,因为awk将大于零的整数值视为true。这给了我们最后的命令:

... | awk -F. 'a[$4]++' 

To generalize the command we can say the pattern of interest isn't the 4th column, it is the next to last column. This can be expressed using number of fields in awk its NF:

为了概括命令我们可以说感兴趣的模式不是第4列,它是倒数第二列。这可以用awk中NF的字段数来表示:

... | awk -F. 'a[$(NF-1)]++'

Output:

./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat