如何使用正则表达式检查字符串是否包含受限制的单词?

时间:2023-02-10 01:33:04

These are the strings that I should not allow in my address:

这些是我不应该在我的地址中允许的字符串:

"PO BOX","P0 DRAWER","POSTOFFICE", " PO ", " BOX ",
 "C/O","C.O."," ICO "," C/O "," C\0 ","C/0","P O BOX",
 "P 0 BOX","P 0 B0X","P0 B0X","P0 BOX","P0BOX","P0B0X",
 "POBX","P0BX","POBOX","P.0.","P.O","P O "," P 0 ",
 "P.O.BOX","P.O.B","POB ","P0B","P 0 B","P O B",
 " CARE ","IN CARE"," APO "," CPO "," UPO ", "GENDEL",
 "GEN DEL", "GENDELIVERY","GEN DELIVERY","GENERALDEL",
 "GENERAL DEL","GENERALDELIVERY","GENERAL DELIVERY"

I created regular expression: This expression validates only POBOx part – please correct to not allow all the above strings in my address field

我创建了正则表达式:此表达式仅验证POBOx部分 - 请更正为不允许我的地址字段中的所有上述字符串

"([\\w\\s*\\W]*((P(O|OST)?.?\\s*((O(FF(ICE)?)?)?.?\\s*(B(IN|OX|.?))|B(IN|OX))+))[\\w\\s*\\W]*)+
|([\\w\\s*\\W]* (IN \s*(CARE)?\\s*)|\s*[\\w\\s*\\W]*((.?(APO)?|.?(cPO)?|.?(uPO))?.?\s*) [\\w\\s*\\W]*|([\\w\\s*\\W]*(GEN(ERAL)?)?.?\s*(DEL(IVERY)?)?.?\s* [\\w\\s*\\W]*))";

3 个解决方案

#1


2  

I'm guessing you're trying to see if an address string contains any restricted phrases.

我猜你正在试图查看地址字符串是否包含任何受限短语。

Please do not do this in one single regex.

请不要在一个正则表达式中执行此操作。

Doing one single massive regex matching query means it's hard to understand what you did to create the regex, hard to extend if more restrictions pop up, and generally not good code practice.

执行一个单一的大规模正则表达式匹配查询意味着很难理解你为创建正则表达式做了什么,如果弹出更多限制就很难扩展,并且通常不是很好的代码练习。


Here's a (hopefully) more sane approach:

这是一个(希望)更理智的方法:

public static final String RESTRICTIONS[] = { " P[0O] ", " B[0O]X ", "etc, etc" };

public static boolean containsRestrictions(String testString) {
    for (String expression : RESTRICTIONS) {
        Matcher restriction = Pattern.compile(expression).matcher(testString);
        if (restriction.find())
            return true;
    }
    return false;
}

You're still doing regex matching, so you can put your fancy schmancy regex into your restrictions list, but it works on just plain old strings too. Now you only need to verify that each of the individual regexes work instead of verifying a giant regex against all possible cases. If you wanna add a new restriction, just add it to the list. If you're real fancy you can load the restrictions from a configuration file or inject it using spring so your pesky product people can add address restrictions without touching une ligne de code.

你仍然在进行正则表达式匹配,所以你可以把你喜欢的schmancy正则表达式放到你的限制列表中,但它也适用于普通的旧字符串。现在,您只需要验证每个单独的正则表达式是否有效,而不是针对所有可能的情况验证巨型正则表达式。如果您想添加新限制,只需将其添加到列表中即可。如果您真的很喜欢,可以从配置文件中加载限制或使用spring注入它,这样您讨厌的产品人员可以添加地址限制而无需触及代码。


Edit: To make this even easier to read, and to do what you really want (restricting strings separated from other strings using whitespace), you can remove regexes altogether from the restrictions and do some basic matching work in your method.

编辑:为了使这更容易阅读,并做你真正想要的事情(使用空格限制与其他字符串分隔的字符串),你可以完全从限制中删除正则表达式,并在你的方法中做一些基本的匹配工作。

// No regexes here, just words you wanna restrict
public static final String RESTRICTIONS[] = { "PO", "PO BOX", "etc, etc" };

public static boolean containsRestrictions(String testString) {
    for (String word : RESTRICTIONS) {
        String expression = "(^|\\s)" + word + "(\\s|$)";
        Matcher restriction = Pattern.compile(expression).matcher(testString);
        if (restriction.find())
            return true;
    }
    return false;
}

#2


1  

So, you want to search substrings like a pro? I'd suggest using the Aho Corasick algorithm which solves the kind of problems you have.

那么,你想像专业人士那样搜索子串吗?我建议使用Aho Corasick算法来解决你遇到的那种问题。

Selling point:

卖点:

It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously.

它是一种字典匹配算法,它在输入文本中定位有限字符串集(“字典”)的元素。它同时匹配所有模式。

Luckily, a Java implementation exists. You can get it here.

幸运的是,存在Java实现。你可以在这里得到它。

Here's how to use it:

以下是如何使用它:

// this is the part you have to do only once

AhoCorasick tree = new AhoCorasick(); 

String[] terms = {"PO BOX","P0 DRAWER",...};

for (int i = 0; i < terms.length; i++) {
     tree.add(terms[i].getBytes(), terms[i]); 
}
tree.prepare();



// here comes the part you use for every address you want to check

String text = "The ga3 mutant of Arabidopsis is a gibberellin-responsive. In UPO, that is...";

boolean restrictedWordFound = false;

@SuppressWarnings("unchecked")
Iterator<SearchResult> search = (Iterator<SearchResult>)tree.search(text.getBytes());

if(search.hasNext()) {
    restrictedWordFound = true;
}

If a match has been found, restrictedWordFound will be true.

如果找到了匹配项,则restrictedWordFound将为true。

Note: this search is case sensitive. Since your strings are all in upper case, I'd suggest you first convert address in a temporary upper case variant and use matching on it. That way, you will cover all possible combinations.

注意:此搜索区分大小写。由于你的字符串都是大写的,我建议你先用临时大写变体转换地址并在其上使用匹配。这样,您将涵盖所有可能的组合。

From my tests, Aho Corasick is faster than regex based search and in most cases faster than naive string searching using contains and other String based methods. You can add even more filter words; Aho Corasick is the way to go.

根据我的测试,Aho Corasick比基于正则表达式的搜索更快,并且在大多数情况下比使用包含其他基于String的方法的天真字符串搜索更快。您可以添加更多过滤词; Aho Corasick是要走的路。

#3


0  

Instead of using such complicated regular expressions, you can state: the regex:

您可以声明:正则表达式:而不是使用这种复杂的正则表达式:

"PO BOX|P0 DRAWER|POSTOFFICE| PO | BOX |C/O|C.O.| ICO | C/O | C\0 |C/0|P O BOX|P 0 BOX|P 0 B0X|P0 B0X|P0 BOX|P0BOX|P0B0X|POBX|P0BX|POBOX|P.0.|P.O|P O | P 0 |P.O.BOX|P.O.B|POB |P0B|P 0 B|P O B| CARE |IN CARE| APO | CPO | UPO |GENDEL|GEN DEL|GENDELIVERY|GEN DELIVERY|GENERALDEL|GENERAL DEL|GENERALDELIVERY|GENERAL DELIVERY"

And negate the answer.

并否定答案。

When you compile the regex (in Java) the resulting mechanism will become more efficiënt. (Java uses DFA minimalisation).

编译正则表达式(在Java中)时,生成的机制将变得更加高效。 (Java使用DFA最小化)。

#1


2  

I'm guessing you're trying to see if an address string contains any restricted phrases.

我猜你正在试图查看地址字符串是否包含任何受限短语。

Please do not do this in one single regex.

请不要在一个正则表达式中执行此操作。

Doing one single massive regex matching query means it's hard to understand what you did to create the regex, hard to extend if more restrictions pop up, and generally not good code practice.

执行一个单一的大规模正则表达式匹配查询意味着很难理解你为创建正则表达式做了什么,如果弹出更多限制就很难扩展,并且通常不是很好的代码练习。


Here's a (hopefully) more sane approach:

这是一个(希望)更理智的方法:

public static final String RESTRICTIONS[] = { " P[0O] ", " B[0O]X ", "etc, etc" };

public static boolean containsRestrictions(String testString) {
    for (String expression : RESTRICTIONS) {
        Matcher restriction = Pattern.compile(expression).matcher(testString);
        if (restriction.find())
            return true;
    }
    return false;
}

You're still doing regex matching, so you can put your fancy schmancy regex into your restrictions list, but it works on just plain old strings too. Now you only need to verify that each of the individual regexes work instead of verifying a giant regex against all possible cases. If you wanna add a new restriction, just add it to the list. If you're real fancy you can load the restrictions from a configuration file or inject it using spring so your pesky product people can add address restrictions without touching une ligne de code.

你仍然在进行正则表达式匹配,所以你可以把你喜欢的schmancy正则表达式放到你的限制列表中,但它也适用于普通的旧字符串。现在,您只需要验证每个单独的正则表达式是否有效,而不是针对所有可能的情况验证巨型正则表达式。如果您想添加新限制,只需将其添加到列表中即可。如果您真的很喜欢,可以从配置文件中加载限制或使用spring注入它,这样您讨厌的产品人员可以添加地址限制而无需触及代码。


Edit: To make this even easier to read, and to do what you really want (restricting strings separated from other strings using whitespace), you can remove regexes altogether from the restrictions and do some basic matching work in your method.

编辑:为了使这更容易阅读,并做你真正想要的事情(使用空格限制与其他字符串分隔的字符串),你可以完全从限制中删除正则表达式,并在你的方法中做一些基本的匹配工作。

// No regexes here, just words you wanna restrict
public static final String RESTRICTIONS[] = { "PO", "PO BOX", "etc, etc" };

public static boolean containsRestrictions(String testString) {
    for (String word : RESTRICTIONS) {
        String expression = "(^|\\s)" + word + "(\\s|$)";
        Matcher restriction = Pattern.compile(expression).matcher(testString);
        if (restriction.find())
            return true;
    }
    return false;
}

#2


1  

So, you want to search substrings like a pro? I'd suggest using the Aho Corasick algorithm which solves the kind of problems you have.

那么,你想像专业人士那样搜索子串吗?我建议使用Aho Corasick算法来解决你遇到的那种问题。

Selling point:

卖点:

It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously.

它是一种字典匹配算法,它在输入文本中定位有限字符串集(“字典”)的元素。它同时匹配所有模式。

Luckily, a Java implementation exists. You can get it here.

幸运的是,存在Java实现。你可以在这里得到它。

Here's how to use it:

以下是如何使用它:

// this is the part you have to do only once

AhoCorasick tree = new AhoCorasick(); 

String[] terms = {"PO BOX","P0 DRAWER",...};

for (int i = 0; i < terms.length; i++) {
     tree.add(terms[i].getBytes(), terms[i]); 
}
tree.prepare();



// here comes the part you use for every address you want to check

String text = "The ga3 mutant of Arabidopsis is a gibberellin-responsive. In UPO, that is...";

boolean restrictedWordFound = false;

@SuppressWarnings("unchecked")
Iterator<SearchResult> search = (Iterator<SearchResult>)tree.search(text.getBytes());

if(search.hasNext()) {
    restrictedWordFound = true;
}

If a match has been found, restrictedWordFound will be true.

如果找到了匹配项,则restrictedWordFound将为true。

Note: this search is case sensitive. Since your strings are all in upper case, I'd suggest you first convert address in a temporary upper case variant and use matching on it. That way, you will cover all possible combinations.

注意:此搜索区分大小写。由于你的字符串都是大写的,我建议你先用临时大写变体转换地址并在其上使用匹配。这样,您将涵盖所有可能的组合。

From my tests, Aho Corasick is faster than regex based search and in most cases faster than naive string searching using contains and other String based methods. You can add even more filter words; Aho Corasick is the way to go.

根据我的测试,Aho Corasick比基于正则表达式的搜索更快,并且在大多数情况下比使用包含其他基于String的方法的天真字符串搜索更快。您可以添加更多过滤词; Aho Corasick是要走的路。

#3


0  

Instead of using such complicated regular expressions, you can state: the regex:

您可以声明:正则表达式:而不是使用这种复杂的正则表达式:

"PO BOX|P0 DRAWER|POSTOFFICE| PO | BOX |C/O|C.O.| ICO | C/O | C\0 |C/0|P O BOX|P 0 BOX|P 0 B0X|P0 B0X|P0 BOX|P0BOX|P0B0X|POBX|P0BX|POBOX|P.0.|P.O|P O | P 0 |P.O.BOX|P.O.B|POB |P0B|P 0 B|P O B| CARE |IN CARE| APO | CPO | UPO |GENDEL|GEN DEL|GENDELIVERY|GEN DELIVERY|GENERALDEL|GENERAL DEL|GENERALDELIVERY|GENERAL DELIVERY"

And negate the answer.

并否定答案。

When you compile the regex (in Java) the resulting mechanism will become more efficiënt. (Java uses DFA minimalisation).

编译正则表达式(在Java中)时,生成的机制将变得更加高效。 (Java使用DFA最小化)。