用于搜索术语的C#代码/算法

时间:2022-09-13 08:27:16

We have 5mb of typical text (just plain words). We have 1000 words/phrases to use as terms to search for in this text.

我们有5mb的典型文本(只是简单的单词)。我们有1000个单词/短语用作本文中要搜索的术语。

What's the most efficient way to do this in .NET (ideally C#)?

在.NET中最有效的方法是什么(理想情况下是C#)?

Our ideas include regex's (a single one, lots of them) plus even the String.Contains stuff.

我们的想法包括正则表达式(单个,很多)以及String.Contains的东西。

The input is a 2mb to 5mb text string - all text. Multiple hits are good, as in each term (of the 1000) that matches then we do want to know about it. Performance in terms of entire time to execute, don't care about footprint. Current algorithm gives about 60 seconds+ using naive string.contains. We don't want 'cat' to provide a match with 'category' or even 'cats' (i.e. entire term word must hit, no stemming).

输入是2mb到5mb的文本字符串 - 所有文本。多次点击是好的,因为在匹配的每个术语(1000)中我们确实想知道它。在整个执行时间方面表现,不关心足迹。当前算法使用天真的string.contains大约60秒+。我们不希望'cat'提供与'category'或甚至'cats'的匹配(即整个术语必须命中,没有词干)。

We expect a <5% hit ratio in the text. The results would ideally just be the terms that matched (dont need position or frequency just yet). We get a new 2-5mb string every 10 seconds, so can't assume we can index the input. The 1000 terms are dynamic, although have a change rate of about 1 change an hour.

我们预计文本中的命中率<5%。理想情况下,结果只是匹配的术语(暂不需要位置或频率)。我们每隔10秒就得到一个新的2-5mb字符串,所以不能假设我们可以索引输入。 1000个术语是动态的,尽管每小时变化率约为1。

9 个解决方案

#1


A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.

一个天真的字符串。包含战争与和平(3.13MB)的762个单词(最后一页)为我运行大约10秒。切换到1000 GUID大约需要5.5秒。

Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.

Regex.IsMatch在大约0.5秒内发现了762个单词(其中大部分也可能在早期页面中),并在2.5秒内排除了GUID。

I'd suggest your problem lies elsewhere...Or you just need some decent hardware.

我建议你的问题出在其他地方......或者你只需​​要一些不错的硬件。

#2


Why reinvent the wheel? Why not just leverage something like Lucene.NET?

为什么重新发明*?为什么不利用像Lucene.NET这样的东西呢?

#3


have you considered the following:

你考虑过以下几点:

  1. do you care about substring? lets say I am looking for the word "cat", nothing more or nothing less. now consider the Knuth-Morris-Pratt algorithm, or string.contains for "concatinate". both of these will return true (or an index). is this ok?

    你关心子串吗?让我说我正在寻找“猫”这个词,仅此而已。现在考虑Knuth-Morris-Pratt算法,或者string.contains用于“concatinate”。这两个都将返回true(或索引)。这个可以吗?

  2. Also you will have to look into the idea of the stemmed or "Finite" state of the word. lets look for "diary" again, the test sentance is "there are many kinds of diaries". well to you and me we have the word "diaries" does this count? if so we will need to preprocess the sentance converting the words to a finite state (diaries -> diary) the sentance will become "there are many kind of diary". now we can say that Diary is in the sentance (please look at the porter Stemmer Algroithm)

    此外,您还必须研究该词的词干或“有限”状态。让我们再次寻找“日记”,测试结果是“有很多种日记”。对你和我来说,我们有“日记”这个词吗?如果是这样,我们将需要预处理将单词转换为有限状态的日期(日记 - >日记),这些日期将变成“有很多种日记”。现在我们可以说日记是在发送中的(请看看搬运工Stemmer Algroithm)

  3. Also when it comes to processing text (aka Natrual Langauge Processing) you can remove some words as noise, take for example "a, have, you, I, me, some, to" <- these could be considered as useless words, and can then be removed before any processing takes place? for example

    此外,当处理文本(也称为Natrual Langauge处理)时,你可以删除一些单词作为噪音,例如“a,have,you,I,me,some,to”< - 这些可以被认为是无用的单词,并且然后可以在进行任何处理之前将其删除?例如

"I have written some C# today", if i have 10,000 key works to look for I would have to scan the entire sentance 10,000 x the number of words in the sentance. removing noise before hand will shorting the processing time

“我今天写了一些C#”,如果我有10,000个关键作品需要查找,我将需要扫描整个发送数量10,000 x发送数量的单词。手前去除噪音会缩短处理时间

"written C# today" <- removed noise, now there are lots less to look throught.

“今天写的C#”< - 删除了噪音,现在有很多东西要看。

A great article on NLP can be found here. Sentance comparing

关于NLP的一篇很棒的文章可以在这里找到。 Sentance比较

HTH

Bones

#4


A modified Suffix tree would be very fast, though it would take up a lot of memory and I don't know how fast it would be to build it. After that however every search would take O(1).

一个修改后的后缀树会非常快,虽然会占用大量内存,但我不知道构建它的速度有多快。之后,每次搜索都需要O(1)。

#5


Here's another idea: Make a class something like this:

这是另一个想法:创建一个这样的类:

class Word
{
    string Word;
    List<int> Positions;
}

For every unique word in your text you create an instance of this class. Positions array will store positions (counted in words, not characters) from the start of the text where this word was found.

对于文本中的每个唯一单词,您可以创建此类的实例。位置数组将存储从找到该单词的文本开头的位置(以单词计算,而不是字符)。

Then make another two lists which will serve as indexes. One will store all these classes sorted by their texts, the other - by their positions in the text. In essence, the text index would probably be a SortedDictionary, while the position index would be a simple List<Word>.

然后创建另外两个将作为索引的列表。一个将按照文本排序所有这些类,另一个 - 按文本中的位置排序。本质上,文本索引可能是SortedDictionary,而位置索引是一个简单的List

Then to search for a phrase, you split that phrase into words. Look up the first word in the Dictionary (that's O(log(n))). From there you know what are the possible words that follow it in the text (you have them from the Positions array). Look at those words (use the position index to find them in O(1)) and go on, until you've found one or more full matches.

然后,要搜索短语,您将该短语拆分为单词。查找词典中的第一个单词(即O(log(n)))。从那里你知道文本中跟随它的可能的单词是什么(你从Positions数组中得到它们)。查看这些单词(使用位置索引在O(1)中找到它们)并继续,直到找到一个或多个完整匹配。

#6


Are you trying to achieve a list of matched words or are you trying to highlight them in the text getting the start and length of the match position? If all you're trying to do is find out if the words exist, then you could use subset theory to fairly efficiently perform this.

您是否尝试获得匹配单词列表,或者您是否尝试在文本中突出显示匹配位置的开头和长度?如果您要做的就是查明单词是否存在,那么您可以使用子集理论来相当有效地执行此操作。

However, I expect you're trying to each match's start position in the text... in which case this approach wouldn't work.

但是,我希望你在尝试每个比赛在文本中的起始位置......在这种情况下,这种方法不起作用。

The most efficient approach I can think is to dynamically build a match pattern using a list and then use regex. It's far easier to maintain a list of 1000 items than it is to maintain a regex pattern based on those same 1000 items.

我能想到的最有效的方法是使用列表动态构建匹配模式,然后使用正则表达式。维护1000个项目的列表比维护基于相同1000个项目的正则表达式模式要容易得多。

It is my understanding that Regex uses the same KMP algorithm suggested to efficiently process large amounts of data - so unless you really need to dig through and understand the minutiae of how it works (which might be beneficial for personal growth), then perhaps regex would be ok.

我的理解是,Regex使用相同的KMP算法建议有效处理大量数据 - 所以除非你真的需要深入了解并了解它的工作原理(这可能对个人成长有利),那么也许正则表达式会好的

There's quite an interesting paper on search algorithms for many patterns in large files here: http://webglimpse.net/pubs/TR94-17.pdf

关于大文件中许多模式的搜索算法,有一篇非常有趣的论文:http://webglimpse.net/pubs/TR94-17.pdf

#7


Is this a bottleneck? How long does it take? 5 MiB isn't actually a lot of data to search in. Regular expressions might do just fine, especially if you encode all the search strings into one pattern using alternations. This basically amortizes the overall cost of the search to O(n + m) where n is the length of your text and m is the length of all patterns, combined. Notice that this is a very good performance.

这是瓶颈吗?多久时间? 5 MiB实际上并不是要搜索的大量数据。正则表达式可能会很好,特别是如果使用替换将所有搜索字符串编码为一个模式。这基本上将搜索的总成本摊销到O(n + m),其中n是文本的长度,m是所有模式的长度,合并。请注意,这是一个非常好的表现。

An alternative that's well suited for many patterns is the Wu Manber algorithm. I've already posted a very simplistic C++ implementation of the algorithm.

一种非常适合许多模式的替代方案是Wu Manber算法。我已经发布了一个非常简单的算法C ++实现。

#8


Ok, current rework shows this as fastest (psuedocode):

好的,目前的返工显示这是最快的(伪代码):

foreach (var term in allTerms)
{
    string pattern = term.ToWord(); // Use /b word boundary regex
    Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
    if (regex.IsMatch(bigTextToSearchForTerms))
    {                    
        result.Add(term);                                        
    }
}

What was surprising (to me at least!) is that running the regex 1000 times was faster that a single regex with 1000 alternatives, i.e. "/b term1 /b | /b term2 /b | /b termN /b" and then trying to use regex.Matches.Count

令人惊讶的是(至少对我来说!)是运行正则表达式1000次比使用1000个替代品的单个正则表达式更快,即“/ b term1 / b | / b term2 / b | / b termN / b”然后尝试使用regex.Matches.Count

#9


How does this perform in comparison? It uses LINQ, so it may be a little slower, not sure...

这相比如何表现?它使用LINQ,所以它可能会慢一点,不确定......

List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));

This uses classic predicates to implement the FIND method, so it should be quicker than LINQ:

这使用经典谓词来实现FIND方法,因此它应该比LINQ更快:

static bool Match(string checkItem)
{
  return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}

static void Main(string[] args)
{
  List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
  List<String> matches = allTerms.Find(Match);
}

Or this uses the lambda syntax to implement the classic predicate, which again should be faster than the LINQ, but is more readable than the previous syntax:

或者这使用lambda语法来实现经典谓词,它也应该比LINQ更快,但比以前的语法更具可读性:

List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));

I haven't tested any of them for performance, but they all implement your idea of iteration through the search list using the regex. It's just different methods of implementing it.

我没有测试它们中的任何性能,但它们都使用正则表达式实现了在搜索列表中迭代的想法。这只是实现它的不同方法。

#1


A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.

一个天真的字符串。包含战争与和平(3.13MB)的762个单词(最后一页)为我运行大约10秒。切换到1000 GUID大约需要5.5秒。

Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.

Regex.IsMatch在大约0.5秒内发现了762个单词(其中大部分也可能在早期页面中),并在2.5秒内排除了GUID。

I'd suggest your problem lies elsewhere...Or you just need some decent hardware.

我建议你的问题出在其他地方......或者你只需​​要一些不错的硬件。

#2


Why reinvent the wheel? Why not just leverage something like Lucene.NET?

为什么重新发明*?为什么不利用像Lucene.NET这样的东西呢?

#3


have you considered the following:

你考虑过以下几点:

  1. do you care about substring? lets say I am looking for the word "cat", nothing more or nothing less. now consider the Knuth-Morris-Pratt algorithm, or string.contains for "concatinate". both of these will return true (or an index). is this ok?

    你关心子串吗?让我说我正在寻找“猫”这个词,仅此而已。现在考虑Knuth-Morris-Pratt算法,或者string.contains用于“concatinate”。这两个都将返回true(或索引)。这个可以吗?

  2. Also you will have to look into the idea of the stemmed or "Finite" state of the word. lets look for "diary" again, the test sentance is "there are many kinds of diaries". well to you and me we have the word "diaries" does this count? if so we will need to preprocess the sentance converting the words to a finite state (diaries -> diary) the sentance will become "there are many kind of diary". now we can say that Diary is in the sentance (please look at the porter Stemmer Algroithm)

    此外,您还必须研究该词的词干或“有限”状态。让我们再次寻找“日记”,测试结果是“有很多种日记”。对你和我来说,我们有“日记”这个词吗?如果是这样,我们将需要预处理将单词转换为有限状态的日期(日记 - >日记),这些日期将变成“有很多种日记”。现在我们可以说日记是在发送中的(请看看搬运工Stemmer Algroithm)

  3. Also when it comes to processing text (aka Natrual Langauge Processing) you can remove some words as noise, take for example "a, have, you, I, me, some, to" <- these could be considered as useless words, and can then be removed before any processing takes place? for example

    此外,当处理文本(也称为Natrual Langauge处理)时,你可以删除一些单词作为噪音,例如“a,have,you,I,me,some,to”< - 这些可以被认为是无用的单词,并且然后可以在进行任何处理之前将其删除?例如

"I have written some C# today", if i have 10,000 key works to look for I would have to scan the entire sentance 10,000 x the number of words in the sentance. removing noise before hand will shorting the processing time

“我今天写了一些C#”,如果我有10,000个关键作品需要查找,我将需要扫描整个发送数量10,000 x发送数量的单词。手前去除噪音会缩短处理时间

"written C# today" <- removed noise, now there are lots less to look throught.

“今天写的C#”< - 删除了噪音,现在有很多东西要看。

A great article on NLP can be found here. Sentance comparing

关于NLP的一篇很棒的文章可以在这里找到。 Sentance比较

HTH

Bones

#4


A modified Suffix tree would be very fast, though it would take up a lot of memory and I don't know how fast it would be to build it. After that however every search would take O(1).

一个修改后的后缀树会非常快,虽然会占用大量内存,但我不知道构建它的速度有多快。之后,每次搜索都需要O(1)。

#5


Here's another idea: Make a class something like this:

这是另一个想法:创建一个这样的类:

class Word
{
    string Word;
    List<int> Positions;
}

For every unique word in your text you create an instance of this class. Positions array will store positions (counted in words, not characters) from the start of the text where this word was found.

对于文本中的每个唯一单词,您可以创建此类的实例。位置数组将存储从找到该单词的文本开头的位置(以单词计算,而不是字符)。

Then make another two lists which will serve as indexes. One will store all these classes sorted by their texts, the other - by their positions in the text. In essence, the text index would probably be a SortedDictionary, while the position index would be a simple List<Word>.

然后创建另外两个将作为索引的列表。一个将按照文本排序所有这些类,另一个 - 按文本中的位置排序。本质上,文本索引可能是SortedDictionary,而位置索引是一个简单的List

Then to search for a phrase, you split that phrase into words. Look up the first word in the Dictionary (that's O(log(n))). From there you know what are the possible words that follow it in the text (you have them from the Positions array). Look at those words (use the position index to find them in O(1)) and go on, until you've found one or more full matches.

然后,要搜索短语,您将该短语拆分为单词。查找词典中的第一个单词(即O(log(n)))。从那里你知道文本中跟随它的可能的单词是什么(你从Positions数组中得到它们)。查看这些单词(使用位置索引在O(1)中找到它们)并继续,直到找到一个或多个完整匹配。

#6


Are you trying to achieve a list of matched words or are you trying to highlight them in the text getting the start and length of the match position? If all you're trying to do is find out if the words exist, then you could use subset theory to fairly efficiently perform this.

您是否尝试获得匹配单词列表,或者您是否尝试在文本中突出显示匹配位置的开头和长度?如果您要做的就是查明单词是否存在,那么您可以使用子集理论来相当有效地执行此操作。

However, I expect you're trying to each match's start position in the text... in which case this approach wouldn't work.

但是,我希望你在尝试每个比赛在文本中的起始位置......在这种情况下,这种方法不起作用。

The most efficient approach I can think is to dynamically build a match pattern using a list and then use regex. It's far easier to maintain a list of 1000 items than it is to maintain a regex pattern based on those same 1000 items.

我能想到的最有效的方法是使用列表动态构建匹配模式,然后使用正则表达式。维护1000个项目的列表比维护基于相同1000个项目的正则表达式模式要容易得多。

It is my understanding that Regex uses the same KMP algorithm suggested to efficiently process large amounts of data - so unless you really need to dig through and understand the minutiae of how it works (which might be beneficial for personal growth), then perhaps regex would be ok.

我的理解是,Regex使用相同的KMP算法建议有效处理大量数据 - 所以除非你真的需要深入了解并了解它的工作原理(这可能对个人成长有利),那么也许正则表达式会好的

There's quite an interesting paper on search algorithms for many patterns in large files here: http://webglimpse.net/pubs/TR94-17.pdf

关于大文件中许多模式的搜索算法,有一篇非常有趣的论文:http://webglimpse.net/pubs/TR94-17.pdf

#7


Is this a bottleneck? How long does it take? 5 MiB isn't actually a lot of data to search in. Regular expressions might do just fine, especially if you encode all the search strings into one pattern using alternations. This basically amortizes the overall cost of the search to O(n + m) where n is the length of your text and m is the length of all patterns, combined. Notice that this is a very good performance.

这是瓶颈吗?多久时间? 5 MiB实际上并不是要搜索的大量数据。正则表达式可能会很好,特别是如果使用替换将所有搜索字符串编码为一个模式。这基本上将搜索的总成本摊销到O(n + m),其中n是文本的长度,m是所有模式的长度,合并。请注意,这是一个非常好的表现。

An alternative that's well suited for many patterns is the Wu Manber algorithm. I've already posted a very simplistic C++ implementation of the algorithm.

一种非常适合许多模式的替代方案是Wu Manber算法。我已经发布了一个非常简单的算法C ++实现。

#8


Ok, current rework shows this as fastest (psuedocode):

好的,目前的返工显示这是最快的(伪代码):

foreach (var term in allTerms)
{
    string pattern = term.ToWord(); // Use /b word boundary regex
    Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
    if (regex.IsMatch(bigTextToSearchForTerms))
    {                    
        result.Add(term);                                        
    }
}

What was surprising (to me at least!) is that running the regex 1000 times was faster that a single regex with 1000 alternatives, i.e. "/b term1 /b | /b term2 /b | /b termN /b" and then trying to use regex.Matches.Count

令人惊讶的是(至少对我来说!)是运行正则表达式1000次比使用1000个替代品的单个正则表达式更快,即“/ b term1 / b | / b term2 / b | / b termN / b”然后尝试使用regex.Matches.Count

#9


How does this perform in comparison? It uses LINQ, so it may be a little slower, not sure...

这相比如何表现?它使用LINQ,所以它可能会慢一点,不确定......

List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));

This uses classic predicates to implement the FIND method, so it should be quicker than LINQ:

这使用经典谓词来实现FIND方法,因此它应该比LINQ更快:

static bool Match(string checkItem)
{
  return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}

static void Main(string[] args)
{
  List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
  List<String> matches = allTerms.Find(Match);
}

Or this uses the lambda syntax to implement the classic predicate, which again should be faster than the LINQ, but is more readable than the previous syntax:

或者这使用lambda语法来实现经典谓词,它也应该比LINQ更快,但比以前的语法更具可读性:

List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));

I haven't tested any of them for performance, but they all implement your idea of iteration through the search list using the regex. It's just different methods of implementing it.

我没有测试它们中的任何性能,但它们都使用正则表达式实现了在搜索列表中迭代的想法。这只是实现它的不同方法。