字符串匹配——BMH算法

给定主串T和模式串P，返回P在T中首次出现的位置，如果P不存在于T中，返回-1。

这样的问题就是字符串匹配问题，这里给出BMH算法的思想。

设主串T的长度为n,模式串P的长度为m。

BMH(Boyer-Moore-Horspool)算法是BM(Boyer-Moore)算法的一种优化，根据《一种基于BMH算法的模式匹配算法》的分析，BMH算法要优于BM算法，BM算法的思想可以参考字符串匹配的Boyer-Moore算法。BM算法的核心在于两个启发式算法，一个叫做坏字符(bad character)，一个叫做好后缀(good suffix)。BMH它不再像BM算法一样关注失配的字符(好后缀)，它的关注的焦点在于这个坏字符上，根据这个字符在模式串P中出现的最后的位置算出偏移长度，否则偏移模式串的长度。1

偏移表

在预处理中，计算大小为 ∣∣∑∣∣ 的偏移表。

$s h i f t [w] = {m - 1 - m a x {i < m - 1 | P [i] = w} m if w is in P [0.. m - 2] otherwise$

例如： P = “pappar”
m = 6
shift[p] = 6 - 1 - max(p的位置) = 6 - 1 - 3 = 2
shift[a] = 6 - 1 - max(a的位置) = 6 - 1 - 4 = 1

伪代码

BMH(T, P)
01 n <- the length of T
02 m <- the length of P
03 // computer the shift table for P
04 for c <- 0 to the length of OffsetTable - 1
05  shift[c] <- m   // default values
06 for k <- 0 to m - 2
07  shift[P[k]] <- m - 1 - k
08 // search
09 s <- 0
10 while s ≤ n - m do
11  j <- s - 1 // start from the end
12  // check if T[s..s+m-1] = P[0..m-1]
13  while T[s+j] = P[j] do
14      j <- j - 1
15      if j < 0 then return s
16  s <- s + shift[T[s + m - 1]]    // shift by last letter
17 return -1

算法实现

const int maxNum = 1005;
int shift[maxNum];
int BMH(const string& T, const string& P) {
int n = T.length();
int m = P.length();

// 默认值，主串左移m位
for(int i = 0; i < maxNum; i++) {
        shift[i] = m;
    }

// 模式串P中每个字母出现的最后的下标，最后一个字母除外
// 主串从不匹配最后一个字符，所需要左移的位数
for(int i = 0; i < m - 1; i++) {
        shift[P[i]] = m - i - 1;
    }

// 模式串开始位置在主串的哪里
int s = 0;
// 从后往前匹配的变量
int j;
while(s <= n - m) {
        j = m - 1;
// 从模式串尾部开始匹配
while(T[s + j] == P[j]) {
            j--;
// 匹配成功
if(j < 0) {
return s;
            }
        }
// 找到坏字符(当前跟模式串匹配的最后一个字符)
// 在模式串中出现最后的位置(最后一位除外)
// 所需要从模式串末尾移动到该位置的步数
        s = s + shift[T[s + m - 1]];
    }
return -1;
}

测试主程序

#include <iostream>
#include <string>

using namespace std;

const int maxNum = 1005;

int shift[maxNum];
int BMH(const string& T, const string& P) {
int n = T.length();
int m = P.length();

// 默认值，主串左移m位
for(int i = 0; i < maxNum; i++) {
        shift[i] = m;
    }

// 模式串P中每个字母出现的最后的下标，最后一个字母除外
// 主串从不匹配最后一个字符，所需要左移的位数
for(int i = 0; i < m - 1; i++) {
        shift[P[i]] = m - i - 1;
    }

// 模式串开始位置在主串的哪里
int s = 0;
// 从后往前匹配的变量
int j;
while(s <= n - m) {
        j = m - 1;
// 从模式串尾部开始匹配
while(T[s + j] == P[j]) {
            j--;
// 匹配成功
if(j < 0) {
return s;
            }
        }
// 找到坏字符(当前跟模式串匹配的最后一个字符)
// 在模式串中出现最后的位置(最后一位除外)
// 所需要从模式串末尾移动到该位置的步数
        s = s + shift[T[s + m - 1]];
    }
return -1;
}


/**
IN
at the thought of
though

OUT
7
**/
int main() {
// 主串和模式串
string T, P;
while(true) {
// 获取一行
        getline(cin, T);
        getline(cin, P);
int res = BMH(T, P);
if(res == -1) {
cout << "主串和模式串不匹配。" << endl;
        } else {
cout << "模式串在主串的位置为：" << res << endl;
        }
    }
return 0;
}

输出数据

at the thought of
though
模式串在主串的位置为：7
dsadasdasa
sa
模式串在主串的位置为：1
ffsafa
fa
模式串在主串的位置为：4
asdhgad
D
主串和模式串不匹配。
FFADSFAFffdsf
SF
模式串在主串的位置为：4
aaaaaaab
aaa
模式串在主串的位置为：0
aaaaab
ab
模式串在主串的位置为：4

算法分析

最坏情况运行时间

预处理：O( ∣∣∑∣∣ +m)
搜索：O(nm)
总计：O(nm)

空间： ∣∣∑∣∣ ，和m独立

该算法在真实数据集合上非常快

对于BMH算法的理解——文本匹配算法 ↩

秒客网

字符串匹配——BMH算法

字符串匹配——BMH算法

偏移表

伪代码

算法实现

测试主程序

输出数据

算法分析

相关文章