您可以使用什么算法在字符串中查找重复的短语?

时间:2022-10-29 22:07:45

Given an arbitrary string, what is an efficient method of finding duplicate phrases? We can say that phrases must be longer than a certain length to be included.

给定一个任意字符串,找到重复短语的有效方法是什么?我们可以说短语必须长于一定长度才能包括在内。

Ideally, you would end up with the number of occurrences for each phrase.

理想情况下,您最终会得到每个短语的出现次数。

5 个解决方案

#1


4  

Like the earlier folks mention that suffix tree is the best tool for the job. My favorite site for suffix trees is http://www.allisons.org/ll/AlgDS/Tree/Suffix/. It enumerates all the nifty uses of suffix trees on one page and has a test js applicaton embedded to test strings and work through examples.

像之前的人一样,提到后缀树是这项工作的最佳工具。我最喜欢的后缀树网站是http://www.allisons.org/ll/AlgDS/Tree/Suffix/。它在一个页面上列举了后缀树的所有漂亮用法,并且嵌入了测试js应用程序以测试字符串并通过示例进行操作。

#2


7  

In theory

  • A suffix array is the 'best' answer since it can be implemented to use linear space and time to detect any duplicate substrings. However - the naive implementation actually takes time O(n^2 log n) to sort the suffixes, and it's not completely obvious how to reduce this down to O(n log n), let alone O(n), although you can read the related papers if you want to.
  • 后缀数组是“最佳”答案,因为它可以实现为使用线性空间和时间来检测任何重复的子字符串。然而 - 天真的实现实际上需要花费时间O(n ^ 2 log n)来对后缀进行排序,并且如何将其减少到O(n log n)并不是完全明显的,更不用说O(n),尽管你可以阅读相关论文,如果你想。

  • A suffix tree can take slightly more memory (still linear, though) than a suffix array, but is easier to implement to build quickly since you can use something like a radix sort idea as you add things to the tree (see the wikipedia link from the name for details).
  • 后缀树可以比后缀数组稍微多一些内存(但仍然是线性的),但是更容易实现快速构建,因为在向树添加内容时可以使用类似基数排序的想法(请参阅*中的链接)详细信息的名称)。

  • The KMP algorithm is also good to be aware of, which is specialized for searching for a particular substring within a longer string very quickly. If you only need this special case, just use KMP and no need to bother building an index of suffices first.
  • KMP算法也值得注意,它专门用于非常快速地搜索较长字符串中的特定子字符串。如果您只需要这种特殊情况,只需使用KMP即可,无需首先构建足够的索引。

In practice

I'm guessing you're analyzing a document of actual natural language (e.g. English) words, and you actually want to do something with the data you collect.

我猜你正在分析一个实际自然语言(例如英语)单词的文档,你实际上想要对你收集的数据做些什么。

In this case, you might just want to do a quick n-gram analysis for some small n, such as just n=2 or 3. For example, you could tokenize your document into a list of words by stripping out punctuation, capitalization, and stemming words (running, runs both -> 'run') to increase semantic matches. Then just build a hash map (such as hash_map in C++, a dictionary in python, etc) of each adjacent pair of words to its number of occurrences so far. In the end you get some very useful data which was very fast to code, and not crazy slow to run.

在这种情况下,您可能只想对一些小n进行快速n-gram分析,例如只有n = 2或3.例如,您可以通过去掉标点符号,大写,将文档标记为单词列表,和词干(运行,运行 - >'运行')以增加语义匹配。然后,只需构建每个相邻词对的哈希映射(例如C ++中的hash_map,python中的字典等)到目前为止的出现次数。最后,你得到了一些非常有用的数据,这些数据代码非常快,而且运行起来并不疯狂。

#3


1  

Suffix trees are a good way to implement this. The bottom of that article has links to implementations in different languages.

后缀树是实现此功能的好方法。该文章的底部链接到不同语言的实现。

#4


0  

Like jmah said, you can use suffix trees/suffix arrays for this.

像jmah所说,你可以使用后缀树/后缀数组。

There is a description of an algorithm you could use here (see Section 3.1).

这里有一个算法的描述(见3.1节)。

You can find a more in-depth description in the book they cite (Gusfield, 1997), which is on google books.

您可以在他们引用的书中找到更深入的描述(Gusfield,1997),这是在谷歌书籍上。

#5


0  

suppose you are given sorted array A with n entries (i=1,2,3,...,n)

假设您被赋予排序的数组A,其中包含n个条目(i = 1,2,3,...,n)

Algo(A(i))
{
  while i<>n
  {
    temp=A[i];
    if A[i]<>A[i+1] then
    {     
      temp=A[i+1];
      i=i+1;
      Algo(A[i])
    }
    else if A[i]==A[i+1] then
      mark A[i] and A[i+1] as duplicates
  }
}

This algo runs at O(n) time.

这个算法在O(n)时间运行。

#1


4  

Like the earlier folks mention that suffix tree is the best tool for the job. My favorite site for suffix trees is http://www.allisons.org/ll/AlgDS/Tree/Suffix/. It enumerates all the nifty uses of suffix trees on one page and has a test js applicaton embedded to test strings and work through examples.

像之前的人一样,提到后缀树是这项工作的最佳工具。我最喜欢的后缀树网站是http://www.allisons.org/ll/AlgDS/Tree/Suffix/。它在一个页面上列举了后缀树的所有漂亮用法,并且嵌入了测试js应用程序以测试字符串并通过示例进行操作。

#2


7  

In theory

  • A suffix array is the 'best' answer since it can be implemented to use linear space and time to detect any duplicate substrings. However - the naive implementation actually takes time O(n^2 log n) to sort the suffixes, and it's not completely obvious how to reduce this down to O(n log n), let alone O(n), although you can read the related papers if you want to.
  • 后缀数组是“最佳”答案,因为它可以实现为使用线性空间和时间来检测任何重复的子字符串。然而 - 天真的实现实际上需要花费时间O(n ^ 2 log n)来对后缀进行排序,并且如何将其减少到O(n log n)并不是完全明显的,更不用说O(n),尽管你可以阅读相关论文,如果你想。

  • A suffix tree can take slightly more memory (still linear, though) than a suffix array, but is easier to implement to build quickly since you can use something like a radix sort idea as you add things to the tree (see the wikipedia link from the name for details).
  • 后缀树可以比后缀数组稍微多一些内存(但仍然是线性的),但是更容易实现快速构建,因为在向树添加内容时可以使用类似基数排序的想法(请参阅*中的链接)详细信息的名称)。

  • The KMP algorithm is also good to be aware of, which is specialized for searching for a particular substring within a longer string very quickly. If you only need this special case, just use KMP and no need to bother building an index of suffices first.
  • KMP算法也值得注意,它专门用于非常快速地搜索较长字符串中的特定子字符串。如果您只需要这种特殊情况,只需使用KMP即可,无需首先构建足够的索引。

In practice

I'm guessing you're analyzing a document of actual natural language (e.g. English) words, and you actually want to do something with the data you collect.

我猜你正在分析一个实际自然语言(例如英语)单词的文档,你实际上想要对你收集的数据做些什么。

In this case, you might just want to do a quick n-gram analysis for some small n, such as just n=2 or 3. For example, you could tokenize your document into a list of words by stripping out punctuation, capitalization, and stemming words (running, runs both -> 'run') to increase semantic matches. Then just build a hash map (such as hash_map in C++, a dictionary in python, etc) of each adjacent pair of words to its number of occurrences so far. In the end you get some very useful data which was very fast to code, and not crazy slow to run.

在这种情况下,您可能只想对一些小n进行快速n-gram分析,例如只有n = 2或3.例如,您可以通过去掉标点符号,大写,将文档标记为单词列表,和词干(运行,运行 - >'运行')以增加语义匹配。然后,只需构建每个相邻词对的哈希映射(例如C ++中的hash_map,python中的字典等)到目前为止的出现次数。最后,你得到了一些非常有用的数据,这些数据代码非常快,而且运行起来并不疯狂。

#3


1  

Suffix trees are a good way to implement this. The bottom of that article has links to implementations in different languages.

后缀树是实现此功能的好方法。该文章的底部链接到不同语言的实现。

#4


0  

Like jmah said, you can use suffix trees/suffix arrays for this.

像jmah所说,你可以使用后缀树/后缀数组。

There is a description of an algorithm you could use here (see Section 3.1).

这里有一个算法的描述(见3.1节)。

You can find a more in-depth description in the book they cite (Gusfield, 1997), which is on google books.

您可以在他们引用的书中找到更深入的描述(Gusfield,1997),这是在谷歌书籍上。

#5


0  

suppose you are given sorted array A with n entries (i=1,2,3,...,n)

假设您被赋予排序的数组A,其中包含n个条目(i = 1,2,3,...,n)

Algo(A(i))
{
  while i<>n
  {
    temp=A[i];
    if A[i]<>A[i+1] then
    {     
      temp=A[i+1];
      i=i+1;
      Algo(A[i])
    }
    else if A[i]==A[i+1] then
      mark A[i] and A[i+1] as duplicates
  }
}

This algo runs at O(n) time.

这个算法在O(n)时间运行。