2模式字符串匹配算法

时间:2022-09-13 07:43:41

I need to code an algorithm for longest two pattern prefix/suffix match whose time complexity is O(n+m1+m2), where n is the length of the String and m1, m2 are the lengths of pattern1 and pattern2 respectively.

我需要为最长的两个模式前缀/后缀匹配编码算法,其时间复杂度为O(n + m1 + m2),其中n是String的长度,m1,m2分别是pattern1和pattern2的长度。

Example: if the String is "OBANTAO" and Pattern1 is "BANANA" and Patten2 is "SIESTA" then the answer is the substring "BANTA" of String that consists of the prefix BAN of BANANA and the suffix TA of SIESTA.

示例:如果字符串为“OBANTAO”且Pattern1为“BANANA”且Patten2为“SIESTA”,则答案为字符串的子字符串“BANTA”,其由BANANA的前缀BAN和SIESTA的后缀TA组成。

Results from Google are: "Rabin-karp string search algorithm", "Knuth-morris-pratt algorithm" and "Boyer-moore string search algorithm".

谷歌的结果是:“Rabin-karp字符串搜索算法”,“Knuth-morris-pratt算法”和“Boyer-moore字符串搜索算法”。

I'm able to understand all the above 3 algorithms, but the problem is that, they are all based on "single pattern prefix/suffix matching". I'm unable to extend them for two pattern prefix/suffix match.

我能够理解以上3种算法,但问题是,它们都是基于“单一模式前缀/后缀匹配”。我无法为两个模式前缀/后缀匹配扩展它们。

A sample algorithm or a link towards searching it would be greatly helpful for me in developing the program.

一个示例算法或搜索它的链接对我开发程序非常有帮助。

2 个解决方案

#1


2  

Knuth--Morris--Pratt can be modified straightforwardly to yield, for each position in the haystack string, the length of the longest prefix of the needle string with a match ending at that position. Use KMP to compute this information for Pat1 in String and reverse(Pat2) in reverse(String), then iterate over each position in String looking for the maximum prefix/suffix length.

Knuth的 - 莫里斯 - 普拉特可以修改直截了当屈服,用于在草堆串中的每个位置,针串在该位置结束的匹配的最长前缀的长度。使用KMP计算该信息用于在PAT1字符串和反向(PAT2)反向(字符串),然后迭代在串中的每个位置寻找最大前缀/后缀长度。

Example with String = "OBANTAO" and Pat1 = "BANANA" and Pat2 = "SIESTA":

String =“OBANTAO”和Pat1 =“BANANA”以及Pat2 =“SIESTA”的示例:

"BANANA" = Pat1 in String
 O B A N T A O
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
| | | | | | | 0 ("")
| | | | | | 0 ("")
| | | | | 0 ("")
| | | | 3 ("BAN")
| | | 2 ("BA")
| | 1 ("B")
| 0 ("")
0 ("")

"ATSEIS" = reverse(Pat2) in reverse(String)
 O A T N A B O
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
| | | | | | | 0 ("")
| | | | | | 0 ("")
| | | | | 1 ("A")
| | | | 0 ("")
| | | 2 ("AT")
| | 1 ("A")
| 0 ("")
0 ("")

Reverse the second array and sum componentwise.

反转第二个数组并按行分组。

  0 0 1 2 3 0 0 0
+ 0 0 1 0 2 1 0 0
-----------------
  0 0 2 2 5 1 0 0
          ^
          |
         max (argument = "BAN" ++ reverse("AT"))

#2


0  

I tried to Implement @David Eisenstat solution in Java. It is in O(2n) time and O(2(n+1)) Auxiliary Spaces

我尝试用Java实现@David Eisenstat解决方案。它在O(2n)时间和O(2(n + 1))辅助空间

String prefix = "BANANA";
    String suffix = "SIESTA";
    String input = "SABANANQS";

    // Prepare Inputs
    char[] prefixArray = prefix.toCharArray();
    char[] suffixArray = suffix.toCharArray();
    char[] inputArray = input.toCharArray();
    int inputLength = inputArray.length;
    int suffixLength = suffixArray.length;
    int prefixLength = prefixArray.length;
    // Auxiliary Spaces O(2(n+1))
    int[] prefixIndexs = new int[inputLength+1];
    int[] suffixIndexs = new int[inputLength+1];

    int m = 0;
    int n = 0;
    // O(1)
    for (int i = 0; i < inputLength; i++) {
        if (inputArray[i] == prefixArray[m]){
            m = m+1;
            prefixIndexs[i+1] = m;
            if (m == prefixLength) {
                m = 0;
            }
        }else{
            m = 0;
        }
        if (inputArray[inputLength-1-i] == suffixArray[suffixLength-1-n]){   // Reverse order or input and reverse oder of suffix
            n = n +1;
            suffixIndexs[i+1] = n;
            if (n == suffixLength) {
                n = 0;
            }
        }else{
            n = 0;
        }
    }

    int currmax =0;
    int mIndex = 0; // prefix Index from start
    int nIndex = 0; // suffix Index from End
    //O(1)  - Do Sum and find the max
    for (int i = 0; i < (inputLength+1); i++) {
        m = prefixIndexs[i];
        n = suffixIndexs[inputLength-i];
        if ( m != 0 && n != 0){  // if both prefix and suffix exists
            if (m+n > currmax){
                currmax = (m+n);
                mIndex = m;
                nIndex = n;
            }
        }
    }

    System.out.println("Input :"+input);
    System.out.println("prefix :"+prefix);
    System.out.println("suffix :"+suffix);
    System.out.println("max :"+currmax);
    System.out.println("mIndex :"+mIndex);
    System.out.println("nIndex :"+nIndex);
    System.out.println(prefix.substring(0,mIndex)+suffix.substring(suffix.length() - nIndex,suffix.length()));

Instead of reset m and n to 0, we can keep an another array for each to implement KMP algorithm , since the input prefix and sufix doesnt has repetitive char sequence i left it

而不是将m和n重置为0,我们可以为每个实现KMP算法保留另一个数组,因为输入前缀和sufix没有重复的char序列我离开了它

#1


2  

Knuth--Morris--Pratt can be modified straightforwardly to yield, for each position in the haystack string, the length of the longest prefix of the needle string with a match ending at that position. Use KMP to compute this information for Pat1 in String and reverse(Pat2) in reverse(String), then iterate over each position in String looking for the maximum prefix/suffix length.

Knuth的 - 莫里斯 - 普拉特可以修改直截了当屈服,用于在草堆串中的每个位置,针串在该位置结束的匹配的最长前缀的长度。使用KMP计算该信息用于在PAT1字符串和反向(PAT2)反向(字符串),然后迭代在串中的每个位置寻找最大前缀/后缀长度。

Example with String = "OBANTAO" and Pat1 = "BANANA" and Pat2 = "SIESTA":

String =“OBANTAO”和Pat1 =“BANANA”以及Pat2 =“SIESTA”的示例:

"BANANA" = Pat1 in String
 O B A N T A O
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
| | | | | | | 0 ("")
| | | | | | 0 ("")
| | | | | 0 ("")
| | | | 3 ("BAN")
| | | 2 ("BA")
| | 1 ("B")
| 0 ("")
0 ("")

"ATSEIS" = reverse(Pat2) in reverse(String)
 O A T N A B O
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
| | | | | | | 0 ("")
| | | | | | 0 ("")
| | | | | 1 ("A")
| | | | 0 ("")
| | | 2 ("AT")
| | 1 ("A")
| 0 ("")
0 ("")

Reverse the second array and sum componentwise.

反转第二个数组并按行分组。

  0 0 1 2 3 0 0 0
+ 0 0 1 0 2 1 0 0
-----------------
  0 0 2 2 5 1 0 0
          ^
          |
         max (argument = "BAN" ++ reverse("AT"))

#2


0  

I tried to Implement @David Eisenstat solution in Java. It is in O(2n) time and O(2(n+1)) Auxiliary Spaces

我尝试用Java实现@David Eisenstat解决方案。它在O(2n)时间和O(2(n + 1))辅助空间

String prefix = "BANANA";
    String suffix = "SIESTA";
    String input = "SABANANQS";

    // Prepare Inputs
    char[] prefixArray = prefix.toCharArray();
    char[] suffixArray = suffix.toCharArray();
    char[] inputArray = input.toCharArray();
    int inputLength = inputArray.length;
    int suffixLength = suffixArray.length;
    int prefixLength = prefixArray.length;
    // Auxiliary Spaces O(2(n+1))
    int[] prefixIndexs = new int[inputLength+1];
    int[] suffixIndexs = new int[inputLength+1];

    int m = 0;
    int n = 0;
    // O(1)
    for (int i = 0; i < inputLength; i++) {
        if (inputArray[i] == prefixArray[m]){
            m = m+1;
            prefixIndexs[i+1] = m;
            if (m == prefixLength) {
                m = 0;
            }
        }else{
            m = 0;
        }
        if (inputArray[inputLength-1-i] == suffixArray[suffixLength-1-n]){   // Reverse order or input and reverse oder of suffix
            n = n +1;
            suffixIndexs[i+1] = n;
            if (n == suffixLength) {
                n = 0;
            }
        }else{
            n = 0;
        }
    }

    int currmax =0;
    int mIndex = 0; // prefix Index from start
    int nIndex = 0; // suffix Index from End
    //O(1)  - Do Sum and find the max
    for (int i = 0; i < (inputLength+1); i++) {
        m = prefixIndexs[i];
        n = suffixIndexs[inputLength-i];
        if ( m != 0 && n != 0){  // if both prefix and suffix exists
            if (m+n > currmax){
                currmax = (m+n);
                mIndex = m;
                nIndex = n;
            }
        }
    }

    System.out.println("Input :"+input);
    System.out.println("prefix :"+prefix);
    System.out.println("suffix :"+suffix);
    System.out.println("max :"+currmax);
    System.out.println("mIndex :"+mIndex);
    System.out.println("nIndex :"+nIndex);
    System.out.println(prefix.substring(0,mIndex)+suffix.substring(suffix.length() - nIndex,suffix.length()));

Instead of reset m and n to 0, we can keep an another array for each to implement KMP algorithm , since the input prefix and sufix doesnt has repetitive char sequence i left it

而不是将m和n重置为0,我们可以为每个实现KMP算法保留另一个数组,因为输入前缀和sufix没有重复的char序列我离开了它