vbscript中检查字符串是否包含多个单词/短语列表中的单词/短语的最快方法

时间:2022-06-17 00:26:22

I am implementing a function which is to check a blurb (e.g. a message/forum post, etc) against a (potentially long) list of banned words/phrases, and simply return true if any one or more of the words is found in the blurb, and false if not.

我正在实现一个功能,它是检查一个简介(例如一个消息/论坛帖子等)与一个(可能很长的)被禁止的词/短语列表,如果在简介中发现任何一个或多个词,简单地返回true,如果没有,返回false。

This is to be done in vbScript.

这是在vbScript中完成的。

The old developer currently has a very large IF statement using instr() e.g.

旧的开发人员目前有一个非常大的IF语句,使用instr(),例如。

    If  instr(ucase(contactname), "KORS") > 0 OR _ 
        instr(ucase(contactname), "D&G") > 0 OR _   
        instr(ucase(contactname), "DOLCE") > 0 OR _     
        instr(ucase(contactname), "GABBANA") > 0 OR _   
        instr(ucase(contactname), "TIFFANY") > 0 OR _
        '...
    Then

I am trying to decide between two solutions to replace the above code:

我正在尝试在两个解决方案中选择一个来替换上面的代码:

  1. Using regular expression to find matches, where the regex would be a simple (but potentially long) regex like this: "KORS|D&G|DOLCE|GABBANA|TIFFANY" and so on, and we would do a regular expression test to return true if any one or more of the words is found.
  2. 使用正则表达式来查找匹配,regex将是一个简单(但可能很长)的正则表达式:“KORS|D&G| D&G| GABBANA|TIFFANY”等等,如果发现任何一个或多个单词,我们将进行正则表达式测试以返回true。
  3. Using an array where each array item contains a banned word, and loop through each array item checking it against the blurb. Once a match is found the loop would terminate and a variable would be set to TRUE, etc.
  4. 使用一个数组,其中每个数组项都包含一个禁止的单词,并循环遍历每个数组项,并根据简介检查它。一旦找到匹配,循环将终止,变量将被设置为TRUE,等等。

It seems to me that the regular expression option is the best, since it is one "check" e.g. the blurb tested against the pattern. But I am wondering if the potentially very long regex pattern would add enough processing overhead to negate the simplicity and benefit of doing the one "check" vs. the many "checks" in the array looping scenario?

在我看来,正则表达式选项是最好的,因为它是一个“检查”,例如,根据模式测试的简介。但是我想知道,潜在的非常长的regex模式是否会增加足够的处理开销,以抵消在数组循环场景中进行一次“检查”与多次“检查”的简单性和好处?

I am also open to additional options which I may have overlooked.

我也愿意接受我可能忽略的其他选择。

Thanks in advance.

提前谢谢。

EDIT - to clarify, this is for a SINGLE test of one "blurb" e.g. a comment, a forum post, etc. against the banned word list. It only runs one time during a web request. The benchmarking should test size of the word list and NOT the number of executions of the use case.

编辑-澄清,这是一个单一的测试一个“简介”,例如一个评论,一个论坛帖子,等等,反对被禁止的词表。它只在web请求期间运行一次。基准测试应该测试单词列表的大小,而不是用例执行的数量。

2 个解决方案

#1


2  

Seems to me (without checking) that such complex regexp would be slower, and also evaluating such complex 'Or' statement wold be slow (VBS will evaluate all alternatives).
Should all alternatives be evaluated to know expression value - of course not.
What I would do, is to populate an array with banned words and then iterate through it, checking if the word is within text being searched - and if word is found discontinue iteration.
You could store the most 'popular' banned words on the top of the array (some kind of rank), so you would be most likely to find them in few first steps.
Another benefit of using array is that it is easier to manage its' values compared to 'hardcoded' values within if statement.

在我看来(没有检查),如此复杂的regexp将会更慢,并且评估如此复杂的'或'语句wold是很慢的(VBS将评估所有的选项)。应该评估所有的替代方法以了解表达式值——当然不是。我要做的是,用禁用的单词填充一个数组,然后遍历它,检查该单词是否在正在搜索的文本中,以及是否发现该单词已停止迭代。你可以将最“流行”的被禁单词存储在数组的顶部(某种级别),所以你很可能在第一步就能找到它们。使用array的另一个好处是,与if语句中的“硬编码”值相比,它更容易管理其“值”。

I just tested 1 000 000 checks with regexp ("word|anotherword") vs InStr for each word and it seems I was not right.
Regex check took 13 seconds while InStr 71 seconds.
Edited: Checking each word separately with regexp took 78 seconds.
Still I think that if you have many banned words checking them one by one and breaking if any is found would be faster (after last check I would consider joining them by (5? 10?) and checking not such complex regexp each time).

我刚刚用regexp(“单词|另一个单词”)和InStr对每个单词进行了1000次测试,结果发现我说得不对。Regex检查耗时13秒,而instr71秒。编辑:使用regexp分别检查每个单词需要78秒。但是我认为如果你有很多禁用的单词,一个一个地检查它们,如果发现有的话就会更快(上次检查后我会考虑在5点之前加入它们)。并且每次都检查不那么复杂的regexp)。

#2


1  

You could create a string that contains all of your words. Surround each word with a delimiter.

您可以创建一个包含所有单词的字符串。用分隔符包围每个单词。

Const TEST_WORDS = "|KORS|D&G|DOLCE|GABBANA|TIFFANY|"

Then, test to see if your word (plus delimiter) is contained within this string:

然后,测试您的单词(加上分隔符)是否包含在这个字符串中:

If InStr(1, TEST_WORDS, "|" & contactname & "|", vbTextCompare) > 0 Then
    ' Found word
End If

No need for array loops or regular expressions.

不需要数组循环或正则表达式。

#1


2  

Seems to me (without checking) that such complex regexp would be slower, and also evaluating such complex 'Or' statement wold be slow (VBS will evaluate all alternatives).
Should all alternatives be evaluated to know expression value - of course not.
What I would do, is to populate an array with banned words and then iterate through it, checking if the word is within text being searched - and if word is found discontinue iteration.
You could store the most 'popular' banned words on the top of the array (some kind of rank), so you would be most likely to find them in few first steps.
Another benefit of using array is that it is easier to manage its' values compared to 'hardcoded' values within if statement.

在我看来(没有检查),如此复杂的regexp将会更慢,并且评估如此复杂的'或'语句wold是很慢的(VBS将评估所有的选项)。应该评估所有的替代方法以了解表达式值——当然不是。我要做的是,用禁用的单词填充一个数组,然后遍历它,检查该单词是否在正在搜索的文本中,以及是否发现该单词已停止迭代。你可以将最“流行”的被禁单词存储在数组的顶部(某种级别),所以你很可能在第一步就能找到它们。使用array的另一个好处是,与if语句中的“硬编码”值相比,它更容易管理其“值”。

I just tested 1 000 000 checks with regexp ("word|anotherword") vs InStr for each word and it seems I was not right.
Regex check took 13 seconds while InStr 71 seconds.
Edited: Checking each word separately with regexp took 78 seconds.
Still I think that if you have many banned words checking them one by one and breaking if any is found would be faster (after last check I would consider joining them by (5? 10?) and checking not such complex regexp each time).

我刚刚用regexp(“单词|另一个单词”)和InStr对每个单词进行了1000次测试,结果发现我说得不对。Regex检查耗时13秒,而instr71秒。编辑:使用regexp分别检查每个单词需要78秒。但是我认为如果你有很多禁用的单词,一个一个地检查它们,如果发现有的话就会更快(上次检查后我会考虑在5点之前加入它们)。并且每次都检查不那么复杂的regexp)。

#2


1  

You could create a string that contains all of your words. Surround each word with a delimiter.

您可以创建一个包含所有单词的字符串。用分隔符包围每个单词。

Const TEST_WORDS = "|KORS|D&G|DOLCE|GABBANA|TIFFANY|"

Then, test to see if your word (plus delimiter) is contained within this string:

然后,测试您的单词(加上分隔符)是否包含在这个字符串中:

If InStr(1, TEST_WORDS, "|" & contactname & "|", vbTextCompare) > 0 Then
    ' Found word
End If

No need for array loops or regular expressions.

不需要数组循环或正则表达式。