比较linux终端中的两个文件。

时间:2021-10-06 16:02:22

There are two files called "a.txt" and "b.txt" both have a list of words. Now I want to check which words are extra in "a.txt" and are not in "b.txt".

有两个文件叫做“a”。txt”和“b。两者都有一个单词列表。现在我想检查一下在a中哪些词是多余的。并不是在"b.txt"中。

I need a efficient algorithm as I need to compare two dictionaries.

我需要一个高效的算法,因为我需要比较两个字典。

8 个解决方案

#1


258  

if you have vim installed,try this:

如果你安装了vim,试试这个:

vimdiff file1 file2

or

vim -d file1 file2

you will find it fantastic.比较linux终端中的两个文件。

你会发现它很棒。

#2


50  

Sort them and use comm:

排序和使用通讯:

comm -23 <(sort a.txt) <(sort b.txt)

comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.

comm比较(排序的)输入文件,默认情况下输出三列:唯一的行,对b来说是唯一的行,在两者中都有行。通过指定-1、-2和/或-3,您可以抑制相应的输出。因此,comm -23 a b只列出了对a唯一的条目。我使用<(…)语法来对文件进行排序,如果它们已经排序了,就不需要这个了。

#3


22  

You can use diff tool in linux to compare two files. You can use --changed-group-format and --unchanged-group-format options to filter required data.

您可以在linux中使用diff工具来比较两个文件。您可以使用——变更组格式和——未更改的组格式选项来筛选所需的数据。

Following three options can use to select the relevant group for each option:

以下三个选项可用于选择每个选项的相关组:

  • '%<' get lines from FILE1

    '%<'从FILE1获取行。

  • '%>' get lines from FILE2

    “%>”从FILE2获取行。

  • '' (empty string) for removing lines from both files.

    (空字符串)用于删除两个文件中的行。

E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt

E。g: diff—changed-group-format="%<"—未变更组格式="" file1。txt file2.txt

  [root@vmoracle11 tmp]# cat file1.txt 
    test one
    test two
    test three
    test four
    test eight
    [root@vmoracle11 tmp]# cat file2.txt 
    test one
    test three
    test nine
    [root@vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt 
    test two
    test four
    test eight

#4


19  

Try sdiff (man sdiff)

尝试sdiff(男人sdiff)

sdiff -s file1 file2

#5


9  

You can also use: colordiff: Displays the output of diff with colors.

您还可以使用:colordiff:显示带有颜色的diff的输出。

About vimdiff: It allows you to compare files via SSH, for example :

关于vimdiff:它允许您通过SSH比较文件,例如:

vimdiff /var/log/secure scp://192.168.1.25/var/log/secure

Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

提取:http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

#6


9  

If you prefer the diff output style from git diff, you can use it with the --no-index flag to compare files not in a git repository:

如果您更喜欢git diff的diff输出样式,那么可以使用—无索引标志来比较git仓库中的文件:

git diff --no-index a.txt b.txt

Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:

在每个文件中使用了两个大约200k文件名称字符串的文件,我用这个方法对这个方法进行了基准测试(内置的timecommand)。

git diff --no-index a.txt b.txt
# ~1.2s

comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s

diff a.txt b.txt
# ~2.6s

sdiff a.txt b.txt
# ~2.7s

vimdiff a.txt b.txt
# ~3.2s

comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.

comm似乎是迄今为止最快的,而git diff——无索引似乎是最快速的方法,用于扩散式输出。


Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:

更新2018-03-25您实际上可以省略——无索引标志,除非您在git存储库中,并希望比较该存储库中的未跟踪文件。从手册页:

This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.

此表单将对文件系统上给定的两条路径进行比较。您可以在使用Git控制的工作树中运行该命令时省略——no-index选项,在工作树之外至少有一个路径点,或者在Git控制的工作树之外运行该命令。

#7


4  

Use comm -13 (requires sorted files):

使用comm -13(需要排序文件):

$ cat file1
one
two
three

$ cat file2
one
two
three
four

$ comm -13 <(sort file1) <(sort file2)
four

#8


0  

Here is my solution for this :

下面是我的解决方案:

mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english

#1


258  

if you have vim installed,try this:

如果你安装了vim,试试这个:

vimdiff file1 file2

or

vim -d file1 file2

you will find it fantastic.比较linux终端中的两个文件。

你会发现它很棒。

#2


50  

Sort them and use comm:

排序和使用通讯:

comm -23 <(sort a.txt) <(sort b.txt)

comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.

comm比较(排序的)输入文件,默认情况下输出三列:唯一的行,对b来说是唯一的行,在两者中都有行。通过指定-1、-2和/或-3,您可以抑制相应的输出。因此,comm -23 a b只列出了对a唯一的条目。我使用<(…)语法来对文件进行排序,如果它们已经排序了,就不需要这个了。

#3


22  

You can use diff tool in linux to compare two files. You can use --changed-group-format and --unchanged-group-format options to filter required data.

您可以在linux中使用diff工具来比较两个文件。您可以使用——变更组格式和——未更改的组格式选项来筛选所需的数据。

Following three options can use to select the relevant group for each option:

以下三个选项可用于选择每个选项的相关组:

  • '%<' get lines from FILE1

    '%<'从FILE1获取行。

  • '%>' get lines from FILE2

    “%>”从FILE2获取行。

  • '' (empty string) for removing lines from both files.

    (空字符串)用于删除两个文件中的行。

E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt

E。g: diff—changed-group-format="%<"—未变更组格式="" file1。txt file2.txt

  [root@vmoracle11 tmp]# cat file1.txt 
    test one
    test two
    test three
    test four
    test eight
    [root@vmoracle11 tmp]# cat file2.txt 
    test one
    test three
    test nine
    [root@vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt 
    test two
    test four
    test eight

#4


19  

Try sdiff (man sdiff)

尝试sdiff(男人sdiff)

sdiff -s file1 file2

#5


9  

You can also use: colordiff: Displays the output of diff with colors.

您还可以使用:colordiff:显示带有颜色的diff的输出。

About vimdiff: It allows you to compare files via SSH, for example :

关于vimdiff:它允许您通过SSH比较文件,例如:

vimdiff /var/log/secure scp://192.168.1.25/var/log/secure

Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

提取:http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

#6


9  

If you prefer the diff output style from git diff, you can use it with the --no-index flag to compare files not in a git repository:

如果您更喜欢git diff的diff输出样式,那么可以使用—无索引标志来比较git仓库中的文件:

git diff --no-index a.txt b.txt

Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:

在每个文件中使用了两个大约200k文件名称字符串的文件,我用这个方法对这个方法进行了基准测试(内置的timecommand)。

git diff --no-index a.txt b.txt
# ~1.2s

comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s

diff a.txt b.txt
# ~2.6s

sdiff a.txt b.txt
# ~2.7s

vimdiff a.txt b.txt
# ~3.2s

comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.

comm似乎是迄今为止最快的,而git diff——无索引似乎是最快速的方法,用于扩散式输出。


Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:

更新2018-03-25您实际上可以省略——无索引标志,除非您在git存储库中,并希望比较该存储库中的未跟踪文件。从手册页:

This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.

此表单将对文件系统上给定的两条路径进行比较。您可以在使用Git控制的工作树中运行该命令时省略——no-index选项,在工作树之外至少有一个路径点,或者在Git控制的工作树之外运行该命令。

#7


4  

Use comm -13 (requires sorted files):

使用comm -13(需要排序文件):

$ cat file1
one
two
three

$ cat file2
one
two
three
four

$ comm -13 <(sort file1) <(sort file2)
four

#8


0  

Here is my solution for this :

下面是我的解决方案:

mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english