As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.



  • file:///home/gms8994/Music/t.A.T.u./
  • file:///home/gms8994/Music/nina%20sky/
  • file:///home/gms8994/Music/A%20Perfect%20Circle/

Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.

理想情况下,我会回复文件:/// home / gms8994 / Music /,因为这是所有3个字符串中最常见的部分。

Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.


From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.


7 个解决方案


Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and @str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.

编辑:对不起,我很抱歉。我很遗憾,我监督在countit(x,q {})中使用我的变量是一个很大的错误。此字符串在Benchmark模块中进行评估,@ str在那里为空。这个解决方案没有我提出的那么快。见下面的更正。我很抱歉。

Perl can be fast:


use strict;
use warnings;

package LCP;

sub LCP {
    return '' unless @_;
    return $_[0] if @_ == 1;
    my $i          = 0;
    my $first      = shift;
    my $min_length = length($first);
    foreach (@_) {
        $min_length = length($_) if length($_) < $min_length;
INDEX: foreach my $ch ( split //, $first ) {
        last INDEX unless $i < $min_length;
        foreach my $string (@_) {
            last INDEX if substr($string, $i, 1) ne $ch;
    continue { $i++ }
    return substr $first, 0, $i;

# Roy's implementation
sub LCP2 {
    return '' unless @_;
    my $prefix = shift;
    for (@_) {
        chop $prefix while (! /^\Q$prefix\E/);
    return $prefix;


Test suite:

#!/usr/bin/env perl

use strict;
use warnings;


package Test::LCP;

use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);

sub test_use : Test(startup => 1) {

sub test_lcp : Test(6) {
    is( LCP::LCP(),      '',    'Without parameters' );
    is( LCP::LCP('abc'), 'abc', 'One parameter' );
    is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
    is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
        'abcd', 'Some common prefix' );
    my @str = map { chomp; $_ } <DATA>;
    is( LCP::LCP(@str),
        'file:///home/gms8994/Music/', 'Test data prefix' );
    is( LCP::LCP2(@str),
        'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
    my $t = countit( 1, sub{LCP::LCP(@str)} );
    diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
    $t = countit( 1, sub{LCP::LCP2(@str)} );
    diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");


Test suite result:


ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr +  0.00 sys =  1.09 CPU) @ 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr +  0.00 sys =  1.07 CPU) @ 16746.73/s (n=17919)

That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.



The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.

Brett Daniel针对“最长公共子串问题”的*条目已经给出的参考是非常好的一般参考(使用伪代码),如上所述。但是,算法可以是指数的。看起来你可能真的想要一个最长公共前缀的算法,这是一个更简单的算法。

Here's the one I use for longest common prefix (and a ref to original URL):


use strict; use warnings;
sub longest_common_prefix {
    # longest_common_prefix( $|@ ): returns $
    # URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
    # find longest common prefix of scalar list
    my $prefix = shift;
    for (@_) {
        chop $prefix while (! /^\Q$prefix\E/);
    return $prefix;

my @str = map {chomp; $_} <DATA>;
print longest_common_prefix(@ARGV), "\n";

If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).

如果您真的想要LCSS实现,请参阅PerlMonks.org上的这些讨论(最长公共子串和最长公共子序列)。 Tree :: Suffix可能是最适合您的通用解决方案,据我所知,它可以实现最佳算法。不幸的是,最近的版本被破但是,Limbic~Region在本文中PerlMonks上引用的讨论中确实存在一个工作子例程(此处随您的数据一起复制)。

#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';

use strict; use warnings;

sub LCS{
    my @str = @_;
    my @pos;
    for my $i (0 .. $#str) {
        my $line = $str[$i];
        for (0 .. length($line) - 1) {
            my $char= substr($line, $_, 1);
            push @{$pos[$i]{$char}}, $_;
    my $sh_str = reduce {length($a) < length($b) ? $a : $b} @str;
    my %map;
    for my $char (split //, $sh_str) {
        my @loop;
        for (0 .. $#pos) {
            next CHAR if ! $pos[$_]{$char};
            push @loop, $pos[$_]{$char};
        my $next = NestedLoops([@loop]);
        while (my @char_map = $next->()) {
            my $key = join '-', @char_map;
            $map{$key} = $char;
    my @pile;
    for my $seq (keys %map) {
        push @pile, $map{$seq};
        for (1 .. 2) {
            my $dir = $_ % 2 ? 1 : -1;
            my @offset = split /-/, $seq;
            $_ += $dir for @offset;
            my $next = join '-', @offset;
            while (exists $map{$next}) {
                $pile[-1] = $dir > 0 ?
                    $pile[-1] . $map{$next} : $map{$next} . $pile[-1];
                $_ += $dir for @offset;
                $next = join '-', @offset;
    return reduce {length($a) > length($b) ? $a : $b} @pile;

my @str = map {chomp; $_} <DATA>;
print LCS(@str), "\n";


It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.



My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.


In Perl, you'll have to split up the string first into characters using something like


@array = split(//, $string);

@array = split(//,$ string);

(splitting on an empty character sets each character into its own element of the array)


Then do a loop, perhaps overall:


$n =0;
@array1 = split(//, $string1);
@array2 = split(//, $string2);
@array3 = split(//, $string3);

while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){

$sameString = substr($string1, 0, $n); #n might have to be n-1

Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.



If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings. Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.


Mathematica happens to have a function for this built in: http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)


If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).

如果你只关心最长的公共前缀,那么从0循环for i直到第i个字符不匹配并返回substr(s,0,i-1)应该快得多。


From http://forums.macosxhints.com/showthread.php?t=33780

my @strings =

my $common_part = undef;
my $sep = chr(0);  # assuming it's not used legitimately
foreach my $str ( @strings ) {

    # First time through loop -- set common
    # to whole
    if ( !defined $common_part ) {
        $common_part = $str;

    if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
        $common_part = $1;

print "Common part = $common_part\n";


Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):

比上面更快,使用perl的原生二进制xor函数,改编自perlmongers解决方案($ + [0]对我不起作用):

sub common_suffix {
    my $comm = shift @_;
    while ($_ = shift @_) {
        $_ = substr($_,-length($comm)) if (length($_) > length($comm));
        $comm = substr($comm,-length($_)) if (length($_) < length($comm));
        if (( $_ ^ $comm ) =~ /(\0*)$/) {
            $comm = substr($comm, -length($1));
        } else {
            return undef;
    return $comm;

sub common_prefix {
    my $comm = shift @_;
    while ($_ = shift @_) {
        $_ = substr($_,0,length($comm)) if (length($_) > length($comm));
        $comm = substr($comm,0,length($_)) if (length($_) < length($comm));
        if (( $_ ^ $comm ) =~ /^(\0*)/) {
            $comm = substr($comm,0,length($1));
        } else {
            return undef;
    return $comm;


