如何确定几个字符串中最长的相似部分?

时间:2021-09-04 22:20:53

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.

根据标题,我试图找到一种方法来以编程方式确定几个字符串之间最长的相似性部分。

Example:

  • file:///home/gms8994/Music/t.A.T.u./
  • file:///home/gms8994/Music/nina%20sky/
  • file:///home/gms8994/Music/A%20Perfect%20Circle/

Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.

理想情况下,我会回复文件:/// home / gms8994 / Music /,因为这是所有3个字符串中最常见的部分。

Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.

具体来说,我正在寻找一个Perl解决方案,但任何语言(甚至伪语言)的解决方案都足够了。

From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.

评论:是的,仅在开头;但是有可能在列表中有一些其他条目,这个问题将被忽略。

7 个解决方案

#1


Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and @str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.

编辑:对不起,我很抱歉。我很遗憾,我监督在countit(x,q {})中使用我的变量是一个很大的错误。此字符串在Benchmark模块中进行评估,@ str在那里为空。这个解决方案没有我提出的那么快。见下面的更正。我很抱歉。

Perl can be fast:

Perl可以很快:

use strict;
use warnings;

package LCP;

sub LCP {
    return '' unless @_;
    return $_[0] if @_ == 1;
    my $i          = 0;
    my $first      = shift;
    my $min_length = length($first);
    foreach (@_) {
        $min_length = length($_) if length($_) < $min_length;
    }
INDEX: foreach my $ch ( split //, $first ) {
        last INDEX unless $i < $min_length;
        foreach my $string (@_) {
            last INDEX if substr($string, $i, 1) ne $ch;
        }
    }
    continue { $i++ }
    return substr $first, 0, $i;
}

# Roy's implementation
sub LCP2 {
    return '' unless @_;
    my $prefix = shift;
    for (@_) {
        chop $prefix while (! /^\Q$prefix\E/);
        }
    return $prefix;
}

1;

Test suite:

#!/usr/bin/env perl

use strict;
use warnings;

Test::LCP->runtests;

package Test::LCP;

use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);

sub test_use : Test(startup => 1) {
    use_ok('LCP');
}

sub test_lcp : Test(6) {
    is( LCP::LCP(),      '',    'Without parameters' );
    is( LCP::LCP('abc'), 'abc', 'One parameter' );
    is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
    is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
        'abcd', 'Some common prefix' );
    my @str = map { chomp; $_ } <DATA>;
    is( LCP::LCP(@str),
        'file:///home/gms8994/Music/', 'Test data prefix' );
    is( LCP::LCP2(@str),
        'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
    my $t = countit( 1, sub{LCP::LCP(@str)} );
    diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
    $t = countit( 1, sub{LCP::LCP2(@str)} );
    diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}

__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

Test suite result:

测试套件结果:

1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr +  0.00 sys =  1.09 CPU) @ 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr +  0.00 sys =  1.07 CPU) @ 16746.73/s (n=17919)

That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.

这意味着使用substr的纯Perl解决方案比在测试用例中的Roy解决方案快20%,并且一个前缀发现大约需要50us。除非您的数据或性能预期更高,否则没有必要使用XS。

#2


The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.

Brett Daniel针对“最长公共子串问题”的*条目已经给出的参考是非常好的一般参考(使用伪代码),如上所述。但是,算法可以是指数的。看起来你可能真的想要一个最长公共前缀的算法,这是一个更简单的算法。

Here's the one I use for longest common prefix (and a ref to original URL):

这是我用于最长公共前缀(以及对原始URL的引用)的那个:

use strict; use warnings;
sub longest_common_prefix {
    # longest_common_prefix( $|@ ): returns $
    # URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
    # find longest common prefix of scalar list
    my $prefix = shift;
    for (@_) {
        chop $prefix while (! /^\Q$prefix\E/);
        }
    return $prefix;
}

my @str = map {chomp; $_} <DATA>;
print longest_common_prefix(@ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).

如果您真的想要LCSS实现,请参阅PerlMonks.org上的这些讨论(最长公共子串和最长公共子序列)。 Tree :: Suffix可能是最适合您的通用解决方案,据我所知,它可以实现最佳算法。不幸的是,最近的版本被破但是,Limbic~Region在本文中PerlMonks上引用的讨论中确实存在一个工作子例程(此处随您的数据一起复制)。

#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';

use strict; use warnings;

sub LCS{
    my @str = @_;
    my @pos;
    for my $i (0 .. $#str) {
        my $line = $str[$i];
        for (0 .. length($line) - 1) {
            my $char= substr($line, $_, 1);
            push @{$pos[$i]{$char}}, $_;
        }
    }
    my $sh_str = reduce {length($a) < length($b) ? $a : $b} @str;
    my %map;
    CHAR:
    for my $char (split //, $sh_str) {
        my @loop;
        for (0 .. $#pos) {
            next CHAR if ! $pos[$_]{$char};
            push @loop, $pos[$_]{$char};
        }
        my $next = NestedLoops([@loop]);
        while (my @char_map = $next->()) {
            my $key = join '-', @char_map;
            $map{$key} = $char;
        }
    }
    my @pile;
    for my $seq (keys %map) {
        push @pile, $map{$seq};
        for (1 .. 2) {
            my $dir = $_ % 2 ? 1 : -1;
            my @offset = split /-/, $seq;
            $_ += $dir for @offset;
            my $next = join '-', @offset;
            while (exists $map{$next}) {
                $pile[-1] = $dir > 0 ?
                    $pile[-1] . $map{$next} : $map{$next} . $pile[-1];
                $_ += $dir for @offset;
                $next = join '-', @offset;
            }
        }
    }
    return reduce {length($a) > length($b) ? $a : $b} @pile;
}

my @str = map {chomp; $_} <DATA>;
print LCS(@str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

#3


It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.

听起来你想要k-common子串算法。它编程非常简单,是动态编程的一个很好的例子。

#4


My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.

我的第一直觉是运行一个循环,从每个字符串中取出下一个字符,直到字符不相等。保持你所在字符串中的位置的计数,然后从0到字符不相等之前的位置取一个子串(从三个字符串中的任何一个)。

In Perl, you'll have to split up the string first into characters using something like

在Perl中,你必须首先使用类似的东西将字符串拆分为字符

@array = split(//, $string);

@array = split(//,$ string);

(splitting on an empty character sets each character into its own element of the array)

(拆分空字符会将每个字符设置为其自己的数组元素)

Then do a loop, perhaps overall:

然后做一个循环,也许整体:

$n =0;
@array1 = split(//, $string1);
@array2 = split(//, $string2);
@array3 = split(//, $string3);

while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
 $n++; 
}

$sameString = substr($string1, 0, $n); #n might have to be n-1

Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.

或者至少在这些方面有所作为。请原谅我,如果这不起作用,我的Perl有点生疏。

#5


If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings. Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.

如果你谷歌搜索“最长公共子串”,你会得到一些很好的指针,一般情况下序列不必从字符串的开头开始。例如,http://en.wikipedia.org/wiki/Longest_common_substring_problem。

Mathematica happens to have a function for this built in: http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)

Mathematica碰巧有这个内置函数:http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html(注意它们意味着连续的子序列,即substring,这就是你想要的。)

If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).

如果你只关心最长的公共前缀,那么从0循环for i直到第i个字符不匹配并返回substr(s,0,i-1)应该快得多。

#6


From http://forums.macosxhints.com/showthread.php?t=33780

my @strings =
    (
      'file:///home/gms8994/Music/t.A.T.u./',
      'file:///home/gms8994/Music/nina%20sky/',
      'file:///home/gms8994/Music/A%20Perfect%20Circle/',
    );

my $common_part = undef;
my $sep = chr(0);  # assuming it's not used legitimately
foreach my $str ( @strings ) {

    # First time through loop -- set common
    # to whole
    if ( !defined $common_part ) {
        $common_part = $str;
        next;
    }

    if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
    {
        $common_part = $1;
    }
}

print "Common part = $common_part\n";

#7


Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):

比上面更快,使用perl的原生二进制xor函数,改编自perlmongers解决方案($ + [0]对我不起作用):

sub common_suffix {
    my $comm = shift @_;
    while ($_ = shift @_) {
        $_ = substr($_,-length($comm)) if (length($_) > length($comm));
        $comm = substr($comm,-length($_)) if (length($_) < length($comm));
        if (( $_ ^ $comm ) =~ /(\0*)$/) {
            $comm = substr($comm, -length($1));
        } else {
            return undef;
        }
    }
    return $comm;
}


sub common_prefix {
    my $comm = shift @_;
    while ($_ = shift @_) {
        $_ = substr($_,0,length($comm)) if (length($_) > length($comm));
        $comm = substr($comm,0,length($_)) if (length($_) < length($comm));
        if (( $_ ^ $comm ) =~ /^(\0*)/) {
            $comm = substr($comm,0,length($1));
        } else {
            return undef;
        }
    }
    return $comm;
}

#1


Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and @str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.

编辑:对不起,我很抱歉。我很遗憾,我监督在countit(x,q {})中使用我的变量是一个很大的错误。此字符串在Benchmark模块中进行评估,@ str在那里为空。这个解决方案没有我提出的那么快。见下面的更正。我很抱歉。

Perl can be fast:

Perl可以很快:

use strict;
use warnings;

package LCP;

sub LCP {
    return '' unless @_;
    return $_[0] if @_ == 1;
    my $i          = 0;
    my $first      = shift;
    my $min_length = length($first);
    foreach (@_) {
        $min_length = length($_) if length($_) < $min_length;
    }
INDEX: foreach my $ch ( split //, $first ) {
        last INDEX unless $i < $min_length;
        foreach my $string (@_) {
            last INDEX if substr($string, $i, 1) ne $ch;
        }
    }
    continue { $i++ }
    return substr $first, 0, $i;
}

# Roy's implementation
sub LCP2 {
    return '' unless @_;
    my $prefix = shift;
    for (@_) {
        chop $prefix while (! /^\Q$prefix\E/);
        }
    return $prefix;
}

1;

Test suite:

#!/usr/bin/env perl

use strict;
use warnings;

Test::LCP->runtests;

package Test::LCP;

use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);

sub test_use : Test(startup => 1) {
    use_ok('LCP');
}

sub test_lcp : Test(6) {
    is( LCP::LCP(),      '',    'Without parameters' );
    is( LCP::LCP('abc'), 'abc', 'One parameter' );
    is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
    is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
        'abcd', 'Some common prefix' );
    my @str = map { chomp; $_ } <DATA>;
    is( LCP::LCP(@str),
        'file:///home/gms8994/Music/', 'Test data prefix' );
    is( LCP::LCP2(@str),
        'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
    my $t = countit( 1, sub{LCP::LCP(@str)} );
    diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
    $t = countit( 1, sub{LCP::LCP2(@str)} );
    diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}

__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

Test suite result:

测试套件结果:

1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr +  0.00 sys =  1.09 CPU) @ 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr +  0.00 sys =  1.07 CPU) @ 16746.73/s (n=17919)

That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.

这意味着使用substr的纯Perl解决方案比在测试用例中的Roy解决方案快20%,并且一个前缀发现大约需要50us。除非您的数据或性能预期更高,否则没有必要使用XS。

#2


The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.

Brett Daniel针对“最长公共子串问题”的*条目已经给出的参考是非常好的一般参考(使用伪代码),如上所述。但是,算法可以是指数的。看起来你可能真的想要一个最长公共前缀的算法,这是一个更简单的算法。

Here's the one I use for longest common prefix (and a ref to original URL):

这是我用于最长公共前缀(以及对原始URL的引用)的那个:

use strict; use warnings;
sub longest_common_prefix {
    # longest_common_prefix( $|@ ): returns $
    # URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
    # find longest common prefix of scalar list
    my $prefix = shift;
    for (@_) {
        chop $prefix while (! /^\Q$prefix\E/);
        }
    return $prefix;
}

my @str = map {chomp; $_} <DATA>;
print longest_common_prefix(@ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).

如果您真的想要LCSS实现,请参阅PerlMonks.org上的这些讨论(最长公共子串和最长公共子序列)。 Tree :: Suffix可能是最适合您的通用解决方案,据我所知,它可以实现最佳算法。不幸的是,最近的版本被破但是,Limbic~Region在本文中PerlMonks上引用的讨论中确实存在一个工作子例程(此处随您的数据一起复制)。

#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';

use strict; use warnings;

sub LCS{
    my @str = @_;
    my @pos;
    for my $i (0 .. $#str) {
        my $line = $str[$i];
        for (0 .. length($line) - 1) {
            my $char= substr($line, $_, 1);
            push @{$pos[$i]{$char}}, $_;
        }
    }
    my $sh_str = reduce {length($a) < length($b) ? $a : $b} @str;
    my %map;
    CHAR:
    for my $char (split //, $sh_str) {
        my @loop;
        for (0 .. $#pos) {
            next CHAR if ! $pos[$_]{$char};
            push @loop, $pos[$_]{$char};
        }
        my $next = NestedLoops([@loop]);
        while (my @char_map = $next->()) {
            my $key = join '-', @char_map;
            $map{$key} = $char;
        }
    }
    my @pile;
    for my $seq (keys %map) {
        push @pile, $map{$seq};
        for (1 .. 2) {
            my $dir = $_ % 2 ? 1 : -1;
            my @offset = split /-/, $seq;
            $_ += $dir for @offset;
            my $next = join '-', @offset;
            while (exists $map{$next}) {
                $pile[-1] = $dir > 0 ?
                    $pile[-1] . $map{$next} : $map{$next} . $pile[-1];
                $_ += $dir for @offset;
                $next = join '-', @offset;
            }
        }
    }
    return reduce {length($a) > length($b) ? $a : $b} @pile;
}

my @str = map {chomp; $_} <DATA>;
print LCS(@str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

#3


It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.

听起来你想要k-common子串算法。它编程非常简单,是动态编程的一个很好的例子。

#4


My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.

我的第一直觉是运行一个循环,从每个字符串中取出下一个字符,直到字符不相等。保持你所在字符串中的位置的计数,然后从0到字符不相等之前的位置取一个子串(从三个字符串中的任何一个)。

In Perl, you'll have to split up the string first into characters using something like

在Perl中,你必须首先使用类似的东西将字符串拆分为字符

@array = split(//, $string);

@array = split(//,$ string);

(splitting on an empty character sets each character into its own element of the array)

(拆分空字符会将每个字符设置为其自己的数组元素)

Then do a loop, perhaps overall:

然后做一个循环,也许整体:

$n =0;
@array1 = split(//, $string1);
@array2 = split(//, $string2);
@array3 = split(//, $string3);

while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
 $n++; 
}

$sameString = substr($string1, 0, $n); #n might have to be n-1

Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.

或者至少在这些方面有所作为。请原谅我,如果这不起作用,我的Perl有点生疏。

#5


If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings. Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.

如果你谷歌搜索“最长公共子串”,你会得到一些很好的指针,一般情况下序列不必从字符串的开头开始。例如,http://en.wikipedia.org/wiki/Longest_common_substring_problem。

Mathematica happens to have a function for this built in: http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)

Mathematica碰巧有这个内置函数:http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html(注意它们意味着连续的子序列,即substring,这就是你想要的。)

If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).

如果你只关心最长的公共前缀,那么从0循环for i直到第i个字符不匹配并返回substr(s,0,i-1)应该快得多。

#6


From http://forums.macosxhints.com/showthread.php?t=33780

my @strings =
    (
      'file:///home/gms8994/Music/t.A.T.u./',
      'file:///home/gms8994/Music/nina%20sky/',
      'file:///home/gms8994/Music/A%20Perfect%20Circle/',
    );

my $common_part = undef;
my $sep = chr(0);  # assuming it's not used legitimately
foreach my $str ( @strings ) {

    # First time through loop -- set common
    # to whole
    if ( !defined $common_part ) {
        $common_part = $str;
        next;
    }

    if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
    {
        $common_part = $1;
    }
}

print "Common part = $common_part\n";

#7


Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):

比上面更快,使用perl的原生二进制xor函数,改编自perlmongers解决方案($ + [0]对我不起作用):

sub common_suffix {
    my $comm = shift @_;
    while ($_ = shift @_) {
        $_ = substr($_,-length($comm)) if (length($_) > length($comm));
        $comm = substr($comm,-length($_)) if (length($_) < length($comm));
        if (( $_ ^ $comm ) =~ /(\0*)$/) {
            $comm = substr($comm, -length($1));
        } else {
            return undef;
        }
    }
    return $comm;
}


sub common_prefix {
    my $comm = shift @_;
    while ($_ = shift @_) {
        $_ = substr($_,0,length($comm)) if (length($_) > length($comm));
        $comm = substr($comm,0,length($_)) if (length($_) < length($comm));
        if (( $_ ^ $comm ) =~ /^(\0*)/) {
            $comm = substr($comm,0,length($1));
        } else {
            return undef;
        }
    }
    return $comm;
}