在不同的目录中查找名称相同的文件并计数重复的文件

时间:2021-02-04 16:37:15

I hope you can help me with the following problem. I have 24 directories each containing many (1000's) of files. I would like to find out which combination of directories contains the most number of duplicate (by name only) files. For example if we only consider 4 directories

我希望你能帮助我解决下面的问题。我有24个目录,每个目录包含许多(1000)个文件。我想知道哪个目录组合包含最多的重复(仅通过名称)文件。例如,如果我们只考虑4个目录

dir1 dir2 dir3 dir4

dir1 dir2 dir3 dir4

with the following directory contents

使用以下目录内容。

dir1

dir1

1.fa 2.fa 3.fa 4.fa 5.fa

1。足总2。足总3。足总4。足总5.足总

dir2

dir2

1.fa 10.fa 15.fa

1。fa 10。足总15.足总

dir3

dir3

1.fa 2.fa 3.fa

1。足总2。足总3.足总

dir4

dir4

1.fa 2.fa 3.fa 5.fa 8.fa 10.fa

1。足总2。足总3。足总5。fa 8。足总10.足总

Therefore, the combination of directories dir1 and dir4 contain the most duplicate files (4).

因此,目录dir1和dir4的组合包含重复最多的文件(4)。

The problem becomes quite large with 24 directories so I was thinking that I might use a brute force approach. Something along the lines of

问题变得非常大,有24个目录,所以我想我可以使用蛮力方法。类似的东西

  1. count all duplicate files that occur in all 24 directories
  2. 计算所有24个目录中出现的所有重复文件
  3. drop a directory and count the number of duplicate files
  4. 删除一个目录并计算重复文件的数量。
  5. replace the directory and drop another one then count number
  6. 替换目录并删除另一个目录,然后计数编号
  7. repeat for all directories
  8. 重复的所有目录
  9. get the subset of 23 directories with max number of duplicate files
  10. 获取包含最多重复文件的23个目录的子集
  11. repeat the above 2-5 and keep the 22 directories with most duplicate files
  12. 重复上面的2-5,并保留22个目录和大部分重复的文件
  13. repeat until only 2 directories left
  14. 重复,直到只剩下2个目录。
  15. choose the combination of directories with the max number of duplicate files
  16. 选择目录和重复文件的最大数目的组合

If any one has a way of doing this I would be very grateful for some advice. I thought of using fdupes or diff but cant figure out how to parse the output and summarise.

如果有人有办法这样做的话,我会非常感激一些建议。我想过使用fdupes或diff,但不知道如何解析输出和总结。

5 个解决方案

#1


3  

I tagged your question with algorithm as I am unaware of any existing bash / linux tools that can help you directly solve this problem. The easiest way would be to construct algorithm for this in a programming language such as Python, C++, or Java instead of using bash shells.

我用算法标记了您的问题,因为我不知道任何现有的bash / linux工具可以帮助您直接解决这个问题。最简单的方法是使用Python、c++或Java等编程语言构建算法,而不是使用bash shell。

That being said, here's a high level analysis of your problem: At first glance it looks like a mininum set cover problem, but it's actually broken down into 2 parts:

话虽如此,这是对你的问题的高层次分析:乍一看,它看起来像是一个极小集封面问题,但实际上它分为两部分:


Part 1 - What is the set of files to cover?

第1部分-需要覆盖的文件集是什么?

You want to find the combination of directories that cover the most number of duplicate files. But first you need to know what the maximum set of duplicate files are within your 24 directories.

您希望找到包含最多重复文件的目录的组合。但是首先您需要知道在您的24个目录中有多少个重复的文件。

Since the intersection of files between 2 directories is always greater than or equal to the intersection with a 3rd directory, you go through all pairs of directories and find what the maximum intersection set is:

由于两个目录之间的文件交集总是大于或等于与第三个目录的交集,所以您要遍历所有对目录并找出最大交集集是什么:

(24 choose 2) = 276 comparisons

You take the largest intersection set found and use that as the set you are actually trying to cover.

取所找到的最大的交点集,并将其作为实际要覆盖的集合。


Part 2 - The minimum set cover problem

第2部分——最小集覆盖问题

This is a well-studied problem in computer science, so you are better served reading from the writings of people much smarter than I.

这是计算机科学研究的一个很好的问题,所以你可以更好地阅读那些比我聪明得多的人的文章。

The only thing I have to note that it's a NP-Complete problem, so it's not trivial.

我唯一要注意的是这是一个np完全问题,所以这不是无关紧要的。


This is the best I can do to address the original formulation of your question, but I have a feeling that it's overkill for what you actually need to accomplish. You should consider updating your question with the actual problem that you need to solve.

这是我能做的最好的事情来解决你的问题的最初提法,但是我有一种感觉,这对于你真正需要完成的事情来说是多余的。你应该考虑用你需要解决的实际问题来更新你的问题。

#2


0  

Count duplicate file names in shell:

在shell中计数重复的文件名:

#! /bin/sh

# directories to test for
dirs='dir1 dir2 dir3 dir4'

# directory pairs already seen
seen=''

for d1 in $dirs; do
    for d2 in $dirs; do
        if echo $seen | grep -q -e " $d1:$d2;" -e " $d2:$d1;"; then
            : # don't count twice
        elif test $d1 != $d2; then
            # remember pair of directories
            seen="$seen $d1:$d2;"
            # count duplicates
            ndups=`ls $d1 $d2 | sort | uniq -c | awk '$1 > 1' | wc -l`
            echo "$d1:$d2 $ndups"
        fi
    done
# sort decreasing and take the first
done | sort -k 2rn | head -1

#3


0  

./count_dups.sh:

/ count_dups.sh:

1 files are duplicated Comparing dir1 to dir2.
3 files are duplicated Comparing dir1 to dir3.
4 files are duplicated Comparing dir1 to dir4.
1 files are duplicated Comparing dir2 to dir3.
2 files are duplicated Comparing dir2 to dir4.
3 files are duplicated Comparing dir3 to dir4.

./count_dups.sh | sort -n | tail -1

/ count_dups。sh |排序-n |尾部-1

4 files are duplicated Comparing dir1 to dir4.

Using the script count_dups.sh:

使用脚本count_dups.sh:

#!/bin/bash

# This assumes (among other things) that the dirs don't have spaces in the names

cd testdirs
declare -a DIRS=(`ls`);

function count_dups {
    DUPS=`ls $1 $2 | sort | uniq -d | wc -l`
    echo "$DUPS files are duplicated comparing $1 to $2."
}

LEFT=0
while [ $LEFT -lt ${#DIRS[@]} ] ; do
    RIGHT=$(( $LEFT + 1 ))
    while [ $RIGHT -lt ${#DIRS[@]} ] ; do
        count_dups ${DIRS[$LEFT]} ${DIRS[$RIGHT]}
        RIGHT=$(( $RIGHT + 1 ))
    done
    LEFT=$(( $LEFT + 1 ))
done

#4


0  

Can we create hash table for all of these 24 directories? If the filename is just number , the hash function will be very easy to design.

我们可以为这24个目录创建哈希表吗?如果文件名只是数字,那么哈希函数将非常容易设计。

If we can use hash table, it will be faster to search and find duplication.

如果我们可以使用哈希表,搜索和查找副本将会更快。

#5


0  

Just for curiosity, I've done some simple tests: 24 directories with approximately 3900 files in each (a random number between 0 and 9999). Both bash-scripts take around 10 seconds each. Here is a basic python-script doing the same in ~0.2s:

出于好奇,我做了一些简单的测试:24个目录,每个目录大约有3900个文件(在0和9999之间的随机数)。两个base -scripts每个都花费大约10秒。下面是一个基本的python脚本,在~0.2s中执行同样的操作:

#!/usr//bin/python

import sys, os

def get_max_duplicates(path):
    items = [(d,set(os.listdir(os.path.join(path,d)))) \
        for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]
    if len(items) < 2: 
        # need at least two directories
        return ("","",0)
    values = [(items[i][0],items[j][0],len(items[i][1].intersection(items[j][1]))) \
        for i in range(len(items)) for j in range(i+1, len(items))]
    return max(values, key=lambda a: a[2])


def main():
    path = sys.argv[1] if len(sys.argv)==2 else os.getcwd()
    r = get_max_duplicates(path)
    print "%s and %s share %d files" % r

if __name__ == '__main__':
    main()

As mentioned by Richard, by using a hash-table (or set in python), we can speed things up. The intersection of two sets is O(min(len(set_a), len(set_b))) and we have to do N(N-1)/2=720 comparisons.

正如Richard所提到的,通过使用hashtable(或在python中设置),我们可以加快速度。两个集合的交点是O(min(len(set_a), len(set_b)))我们需要做N(N-1)/2=720比较。

#1


3  

I tagged your question with algorithm as I am unaware of any existing bash / linux tools that can help you directly solve this problem. The easiest way would be to construct algorithm for this in a programming language such as Python, C++, or Java instead of using bash shells.

我用算法标记了您的问题,因为我不知道任何现有的bash / linux工具可以帮助您直接解决这个问题。最简单的方法是使用Python、c++或Java等编程语言构建算法,而不是使用bash shell。

That being said, here's a high level analysis of your problem: At first glance it looks like a mininum set cover problem, but it's actually broken down into 2 parts:

话虽如此,这是对你的问题的高层次分析:乍一看,它看起来像是一个极小集封面问题,但实际上它分为两部分:


Part 1 - What is the set of files to cover?

第1部分-需要覆盖的文件集是什么?

You want to find the combination of directories that cover the most number of duplicate files. But first you need to know what the maximum set of duplicate files are within your 24 directories.

您希望找到包含最多重复文件的目录的组合。但是首先您需要知道在您的24个目录中有多少个重复的文件。

Since the intersection of files between 2 directories is always greater than or equal to the intersection with a 3rd directory, you go through all pairs of directories and find what the maximum intersection set is:

由于两个目录之间的文件交集总是大于或等于与第三个目录的交集,所以您要遍历所有对目录并找出最大交集集是什么:

(24 choose 2) = 276 comparisons

You take the largest intersection set found and use that as the set you are actually trying to cover.

取所找到的最大的交点集,并将其作为实际要覆盖的集合。


Part 2 - The minimum set cover problem

第2部分——最小集覆盖问题

This is a well-studied problem in computer science, so you are better served reading from the writings of people much smarter than I.

这是计算机科学研究的一个很好的问题,所以你可以更好地阅读那些比我聪明得多的人的文章。

The only thing I have to note that it's a NP-Complete problem, so it's not trivial.

我唯一要注意的是这是一个np完全问题,所以这不是无关紧要的。


This is the best I can do to address the original formulation of your question, but I have a feeling that it's overkill for what you actually need to accomplish. You should consider updating your question with the actual problem that you need to solve.

这是我能做的最好的事情来解决你的问题的最初提法,但是我有一种感觉,这对于你真正需要完成的事情来说是多余的。你应该考虑用你需要解决的实际问题来更新你的问题。

#2


0  

Count duplicate file names in shell:

在shell中计数重复的文件名:

#! /bin/sh

# directories to test for
dirs='dir1 dir2 dir3 dir4'

# directory pairs already seen
seen=''

for d1 in $dirs; do
    for d2 in $dirs; do
        if echo $seen | grep -q -e " $d1:$d2;" -e " $d2:$d1;"; then
            : # don't count twice
        elif test $d1 != $d2; then
            # remember pair of directories
            seen="$seen $d1:$d2;"
            # count duplicates
            ndups=`ls $d1 $d2 | sort | uniq -c | awk '$1 > 1' | wc -l`
            echo "$d1:$d2 $ndups"
        fi
    done
# sort decreasing and take the first
done | sort -k 2rn | head -1

#3


0  

./count_dups.sh:

/ count_dups.sh:

1 files are duplicated Comparing dir1 to dir2.
3 files are duplicated Comparing dir1 to dir3.
4 files are duplicated Comparing dir1 to dir4.
1 files are duplicated Comparing dir2 to dir3.
2 files are duplicated Comparing dir2 to dir4.
3 files are duplicated Comparing dir3 to dir4.

./count_dups.sh | sort -n | tail -1

/ count_dups。sh |排序-n |尾部-1

4 files are duplicated Comparing dir1 to dir4.

Using the script count_dups.sh:

使用脚本count_dups.sh:

#!/bin/bash

# This assumes (among other things) that the dirs don't have spaces in the names

cd testdirs
declare -a DIRS=(`ls`);

function count_dups {
    DUPS=`ls $1 $2 | sort | uniq -d | wc -l`
    echo "$DUPS files are duplicated comparing $1 to $2."
}

LEFT=0
while [ $LEFT -lt ${#DIRS[@]} ] ; do
    RIGHT=$(( $LEFT + 1 ))
    while [ $RIGHT -lt ${#DIRS[@]} ] ; do
        count_dups ${DIRS[$LEFT]} ${DIRS[$RIGHT]}
        RIGHT=$(( $RIGHT + 1 ))
    done
    LEFT=$(( $LEFT + 1 ))
done

#4


0  

Can we create hash table for all of these 24 directories? If the filename is just number , the hash function will be very easy to design.

我们可以为这24个目录创建哈希表吗?如果文件名只是数字,那么哈希函数将非常容易设计。

If we can use hash table, it will be faster to search and find duplication.

如果我们可以使用哈希表,搜索和查找副本将会更快。

#5


0  

Just for curiosity, I've done some simple tests: 24 directories with approximately 3900 files in each (a random number between 0 and 9999). Both bash-scripts take around 10 seconds each. Here is a basic python-script doing the same in ~0.2s:

出于好奇,我做了一些简单的测试:24个目录,每个目录大约有3900个文件(在0和9999之间的随机数)。两个base -scripts每个都花费大约10秒。下面是一个基本的python脚本,在~0.2s中执行同样的操作:

#!/usr//bin/python

import sys, os

def get_max_duplicates(path):
    items = [(d,set(os.listdir(os.path.join(path,d)))) \
        for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]
    if len(items) < 2: 
        # need at least two directories
        return ("","",0)
    values = [(items[i][0],items[j][0],len(items[i][1].intersection(items[j][1]))) \
        for i in range(len(items)) for j in range(i+1, len(items))]
    return max(values, key=lambda a: a[2])


def main():
    path = sys.argv[1] if len(sys.argv)==2 else os.getcwd()
    r = get_max_duplicates(path)
    print "%s and %s share %d files" % r

if __name__ == '__main__':
    main()

As mentioned by Richard, by using a hash-table (or set in python), we can speed things up. The intersection of two sets is O(min(len(set_a), len(set_b))) and we have to do N(N-1)/2=720 comparisons.

正如Richard所提到的,通过使用hashtable(或在python中设置),我们可以加快速度。两个集合的交点是O(min(len(set_a), len(set_b)))我们需要做N(N-1)/2=720比较。