Java Regular Expression特殊字符转义

I am trying to create a regular expression that accepts almost every character on an American keyboard except for a select few characters. This is what I currently have(not all is included):

我试图创建一个正则表达式,几乎接受美国键盘上的每个字符,除了选择几个字符。这就是我目前拥有的(并非所有内容都包括在内):

^[a-zA-Z0-9!~`@#$%\\^]

Now I know the ^ is the first character I have come across that needs an escape in front of it. When I put one \ I get a compilation error (invalid escape sequence). When I run this against a String it completely ignores the ^ rule. Anyone know what I'm doing wrong?

现在我知道^是我遇到的第一个需要在它前面逃脱的角色。当我放一个\我得到一个编译错误(无效的转义序列)。当我针对String运行它时,它完全忽略了^规则。谁知道我做错了什么?

2 个解决方案

#1

You only need to escape ^ when you want to match it literally, that is, you want to look for text containing the ^ character.

当你想要按字面意思匹配它时,你只需要转义^,也就是说,你想要查找包含^字符的文本。

If you intend to use the ^ with its special meaning (the start of a line/string) then there is no need to escape it. Simply type

如果您打算使用具有特殊含义的^(行/字符串的开头),则无需转义它。只需输入

"^[a-zA-Z0-9!~`@#$%\\^]"

in your source code. The backslashes towards the end of this regular expression do not matter. You need to type 2 backslashes because of the special meaning of the backslash in Java but that has nothing to do with its treatment regular expressions. The regular expression engine receives a single backslash which it uses to read the following character as literal but ^ is a literal within brackets anyway.

在你的源代码中。朝向此正则表达式末尾的反斜杠无关紧要。由于Java中反斜杠的特殊含义,您需要输入2个反斜杠,但这与其处理正则表达式无关。正则表达式引擎接收一个反斜杠,用于将以下字符作为文字读取,但^无论如何都是括号内的文字。

To elaborate on your comment about [ and ]:

详细说明你对[和]的评论:

The brackets have a special meaning in regular expressions as they basically form the boundaries of the character list given by a pattern (the mentioned characters form a so called character class). Let's decompose the regular expression from above to make things clear.

括号在正则表达式中具有特殊含义,因为它们基本上形成由模式给出的字符列表的边界(所提到的字符形成所谓的字符类)。让我们从上面分解正则表达式以使事情清楚。

^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\ Backslash. Regular expression engine only receives single backslash as the other backslash is consumed by Java's syntax for Strings. Would be used to mark following character as literal but ^ is a literal in character class definitions anyway so theses backslashes are ignored.
^ Caret, literally
] Closing boundary of your character class

The order of patterns within the character class definition is irrelevant. The expression above matches matches if the first character of the examined text is part of your character class definition. It depends on how you use the regular expression if the other characters in the examined text matter.

字符类定义中的模式顺序无关紧要。如果检查文本的第一个字符是字符类定义的一部分,则上面的表达式匹配匹配。如果检查文本中的其他字符很重要,则取决于您如何使用正则表达式。

When you start with regular expressions you should always use multiple test texts to match a against and verify the behaviour. It is also advisable to make these test cases a unit test to get high confidence of the correct behaviour of your program.

当您从正则表达式开始时,您应该始终使用多个测试文本来匹配并验证行为。建议将这些测试用例作为单元测试,以确保程序的正确行为具有高可信度。

A simple code sample to test the expression is as follows:

用于测试表达式的简单代码示例如下:

public class Test {
    public static void main(String[] args) {
        String regexp = "^[ a-zA-Z0-9!~`@#$%\\\\^\\[\\]]+$";
        String[] testdata = new String[] {
                "abc",
                "2332",
                "some@test",
                "test [ and ] test end",
                // Following sample will not match the pattern.
                "äöüßµøł"
        };
        for (String toExamine : testdata) {
            if (toExamine.matches(regexp)) {
                System.out.println("Match: " + toExamine);
            } else {
                System.out.println("No match: " + toExamine);
            }
        }
    }
}

Note the I use a modified pattern here. It ensures all characters in the examined string are matching your character class. I did extend the character class to allow for a \ and space and [ and ]. The decomposed description is:

请注意,我在这里使用修改后的模式。它确保检查的字符串中的所有字符都与您的字符类匹配。我确实扩展了字符类以允许\和空格和[和]。分解的描述是:

^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\\\ Backslash, literally. Regular expression engine only receives 2 backslashes as every other backslash is consumed by Java's syntax for Strings. The first backslash is seen as marking the second backslash a occurring literally in the string.
^ Caret, literally
\\[ Opening bracket, literally. The backslash makes the bracket loose its meaning as opening a character class definition.
\\] Closing bracket, literally. The backslash makes the bracket loose its meaning as closing a character class definition.
] Closing boundary of your character class
+ Means any number of characters matching your character class definition can occur, but at least 1 such character needs to be present for a match
$ Matches the start of the text

One thing I don't get though is why one would use the characters of American keyboards as criteria for validation.

但我不知道的一件事是为什么人们会使用美国键盘的字符作为验证的标准。

#2

You don't have to escape ^ since you are using a character class, just use:

您不必转义^因为您正在使用字符类,只需使用:

^[a-zA-Z0-9!~`@#$%^]

The character class used by [ ...] allows you to put the characters you want and the special characters are no special anymore within square brackets. The only cases you should escape is if you are using for instance a shortcut range like \d or \w, since you are using a backslash in java then you need to escape it as \\d or \\w (but just because of java, not the regex engine).

[...]使用的字符类允许您放置所需的字符,并且特殊字符在方括号内不再特殊。您应该逃脱的唯一情况是,如果您使用的是例如\ d或\ w的快捷范围,因为您在java中使用反斜杠,那么您需要将其转义为\\ d或\\ w(但仅仅因为java,而不是正则表达式引擎)。

For example:

"a".matches("^[a-zA-Z0-9!~`@#$%^]");
"asdf".matches("^[a-zA-Z0-9!~`@#$%^]+"); // for multiple characters

#1