用于匹配转义字符(引号)的正则表达式

时间:2022-09-15 16:22:36

I want to build a simple regex that covers quoted strings, including any escaped quotes within them. For instance,

我想构建一个简单的regex,它包含引用字符串,包括其中的任何转义引号。例如,

"This is valid"
"This is \" also \" valid"

Obviously, something like

显然,这样的

"([^"]*)"

does not work, because it matches up to the first escaped quote.

不工作,因为它匹配第一个转义引用。

What is the correct version?

正确的版本是什么?

I suppose the answer would be the same for other escaped characters (by just replacing the respective character).

我想其他转义字符的答案也是一样的(只需替换各自的字符)。

By the way, I am aware of the "catch-all" regex

顺便说一下,我知道“包罗万象”的regex

"(.*?)"

but I try to avoid it whenever possible, because, not surprisingly, it runs somewhat slower than a more specific one.

但我尽量避免它,因为,毫不奇怪,它比一个更具体的慢一些。

5 个解决方案

#1


1  

The problem with all the other answers is they only match for the initial obvious testing, but fall short to further scrutiny. For example, all of the answers expect that the very first quote will not be escaped. But most importantly, escaping is a more complex process than just a single backslash, because that backslash itself can be escaped. Imagine trying to actually match a string which ends with a backslash. How would that be possible?

所有其他答案的问题在于,它们只与最初显而易见的测试相匹配,但无法进行进一步的详细检查。例如,所有的答案都期望第一个引用不会被转义。但最重要的是,转义是一个比单个反斜杠更复杂的过程,因为反斜杠本身可以被转义。想象一下,尝试匹配一个以反斜杠结尾的字符串。这怎么可能呢?

This would be the pattern you are looking for. It doesn't assume that the first quote is the working one, and it will allow for backslashes to be escaped.

这将是您正在寻找的模式。它不会假设第一个引用是有效的,它会允许反斜杠被转义。

(?<!\\)(?:\\{2})*"(?:(?<!\\)(?:\\{2})*\\"|[^"])+(?<!\\)(?:\\{2})*"

#2


13  

Here is one that I've used in the past:

这里有一个我过去用过的:

("[^"\\]*(?:\\.[^"\\]*)*")

This will capture quoted strings, along with any escaped quote characters, and exclude anything that doesn't appear in enclosing quotes.

这将捕获引用的字符串,以及任何转义的引号字符,并排除任何在引号中不出现的内容。

For example, the pattern will capture "This is valid" and "This is \" also \" valid" from this string:

例如,模式将从这个字符串中捕获“This is valid”和“This is \”。

"This is valid" this won't be captured "This is \" also \" valid"

This pattern will not match the string "I don't \"have\" a closing quote, and will allow for additional escape codes in the string (e.g., it will match "hello world!\n").

此模式将不匹配字符串“I don't \”的结尾引号,并允许在字符串中添加转义代码(例如,它将匹配“hello world!\n”)。

Of course, you'll have to escape the pattern to use it in your code, like so:

当然,要在代码中使用模式,您必须转义该模式,如下所示:

"(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")"

#3


2  

Try this one... It prefers the \", if that matches, it will pick it, otherwise it will pick ".

试试这个…它喜欢“如果匹配,它会选择它,否则它会选择”。

"((?:\\"|[^"])*)"

Once you have matched the string, you'll need to take the first captured group's value and replace \" with ".

一旦您匹配了字符串,您将需要使用第一个捕获组的值并替换“with”。

Edit: Fixed grouping logic.

编辑:固定的逻辑分组。

#4


1  

Please find in the below code comprising expression evaluation for String, Number and Decimal.

请在下面的代码中找到字符串、数字和小数的表达式。

public static void commaSeparatedStrings() {        
    String value = "'It\\'s my world', 'Hello World', 'What\\'s up', 'It\\'s just what I expected.'";

    if (value.matches("'([^\'\\\\]*(?:\\\\.[^\'\\\\])*)[\\w\\s,\\.]+'(((,)|(,\\s))'([^\'\\\\]*(?:\\\\.[^\'\\\\])*)[\\w\\s,\\.]+')*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

/**
 * 
 */
public static void commaSeparatedDecimals() {
    String value = "-111.00, 22111.00, -1.00";
    // "\\d+([,]|[,\\s]\\d+)*"
    if (value.matches(
            "^([-]?)\\d+\\.\\d{1,10}?(((,)|(,\\s))([-]?)\\d+\\.\\d{1,10}?)*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

/**
 * 
 */
public static void commaSeparatedNumbers() {
    String value = "-11, 22, -31";      
    if (value.matches("^([-]?)\\d+(((,)|(,\\s))([-]?)\\d+)*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

#5


1  

This

("((?:[^"\\])*(?:\\\")*(?:\\\\)*)*")

will capture all strings (within double quotes), including \" and \\ escape sequences. (Note that this answer assumes that the only escape sequences in your string are \" or \\ sequences -- no other backslash characters or escape sequences will be captured.)

将捕获所有字符串(在双引号内),包括“\”和\转义序列。(注意,这个答案假设您的字符串中唯一的转义序列是“\”或“\”序列——不会捕获其他反斜杠字符或转义序列。)

("(?:         # begin with a quote and capture...
  (?:[^"\\])* # any non-\, non-" characters
  (?:\\\")*   # any combined \" sequences
  (?:\\\\)*   # and any combined \\ sequences
  )*          # any number of times
")            # then, close the string with a quote

Try it out here!

在这里试试!

Also, note that maksymiuk's accepted answer contains an "edge case" ("Imagine trying to actually match a string which ends with a backslash") which is actually just a malformed string. Something like

另外,请注意,maksymiuk接受的答案包含一个“edge case”(“想象一下试图匹配一个以反斜杠结尾的字符串”),这实际上只是一个格式不正确的字符串。类似的

"this\"

...is not a "string ending on a backslash", but an unclosed string ending on an escaped quotation mark. A string which truly ends on a backslash would look like

…不是“以反斜杠结尾的字符串”,而是以转义引号结尾的未关闭字符串。一个真正以反斜线结束的字符串应该是这样的

"this\\"

...and the above solution handles this case.

…上面的解可以处理这种情况。


If you want to expand a bit, this...

如果你想扩大一点,这个…

(\\(?:b|t|n|f|r|\"|\\)|\\(?:(?:[0-2][0-9]{1,2}|3[0-6][0-9]|37[0-7]|[0-9]{1,2}))|\\(?:u(?:[0-9a-fA-F]{4})))

...captures all common escape sequences (including escaped quotes):

…捕获所有常见的转义序列(包括转义引号):

(\\                       # get the preceding slash (for each section)
  (?:b|t|n|f|r|\"|\\)     # capture common sequences like \n and \t

  |\\                     # OR (get the preceding slash and)...
  # capture variable-width octal escape sequences like \02, \13, or \377
  (?:(?:[0-2][0-9]{1,2}|3[0-6][0-9]|37[0-7]|[0-9]{1,2}))

  |\\                     # OR (get the preceding slash and)...
  (?:u(?:[0-9a-fA-F]{4})) # capture fixed-width Unicode sequences like \u0242 or \uFFAD
)

See this Gist for more information on the second point.

有关第二点的更多信息,请参阅本文的要点。

#1


1  

The problem with all the other answers is they only match for the initial obvious testing, but fall short to further scrutiny. For example, all of the answers expect that the very first quote will not be escaped. But most importantly, escaping is a more complex process than just a single backslash, because that backslash itself can be escaped. Imagine trying to actually match a string which ends with a backslash. How would that be possible?

所有其他答案的问题在于,它们只与最初显而易见的测试相匹配,但无法进行进一步的详细检查。例如,所有的答案都期望第一个引用不会被转义。但最重要的是,转义是一个比单个反斜杠更复杂的过程,因为反斜杠本身可以被转义。想象一下,尝试匹配一个以反斜杠结尾的字符串。这怎么可能呢?

This would be the pattern you are looking for. It doesn't assume that the first quote is the working one, and it will allow for backslashes to be escaped.

这将是您正在寻找的模式。它不会假设第一个引用是有效的,它会允许反斜杠被转义。

(?<!\\)(?:\\{2})*"(?:(?<!\\)(?:\\{2})*\\"|[^"])+(?<!\\)(?:\\{2})*"

#2


13  

Here is one that I've used in the past:

这里有一个我过去用过的:

("[^"\\]*(?:\\.[^"\\]*)*")

This will capture quoted strings, along with any escaped quote characters, and exclude anything that doesn't appear in enclosing quotes.

这将捕获引用的字符串,以及任何转义的引号字符,并排除任何在引号中不出现的内容。

For example, the pattern will capture "This is valid" and "This is \" also \" valid" from this string:

例如,模式将从这个字符串中捕获“This is valid”和“This is \”。

"This is valid" this won't be captured "This is \" also \" valid"

This pattern will not match the string "I don't \"have\" a closing quote, and will allow for additional escape codes in the string (e.g., it will match "hello world!\n").

此模式将不匹配字符串“I don't \”的结尾引号,并允许在字符串中添加转义代码(例如,它将匹配“hello world!\n”)。

Of course, you'll have to escape the pattern to use it in your code, like so:

当然,要在代码中使用模式,您必须转义该模式,如下所示:

"(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")"

#3


2  

Try this one... It prefers the \", if that matches, it will pick it, otherwise it will pick ".

试试这个…它喜欢“如果匹配,它会选择它,否则它会选择”。

"((?:\\"|[^"])*)"

Once you have matched the string, you'll need to take the first captured group's value and replace \" with ".

一旦您匹配了字符串,您将需要使用第一个捕获组的值并替换“with”。

Edit: Fixed grouping logic.

编辑:固定的逻辑分组。

#4


1  

Please find in the below code comprising expression evaluation for String, Number and Decimal.

请在下面的代码中找到字符串、数字和小数的表达式。

public static void commaSeparatedStrings() {        
    String value = "'It\\'s my world', 'Hello World', 'What\\'s up', 'It\\'s just what I expected.'";

    if (value.matches("'([^\'\\\\]*(?:\\\\.[^\'\\\\])*)[\\w\\s,\\.]+'(((,)|(,\\s))'([^\'\\\\]*(?:\\\\.[^\'\\\\])*)[\\w\\s,\\.]+')*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

/**
 * 
 */
public static void commaSeparatedDecimals() {
    String value = "-111.00, 22111.00, -1.00";
    // "\\d+([,]|[,\\s]\\d+)*"
    if (value.matches(
            "^([-]?)\\d+\\.\\d{1,10}?(((,)|(,\\s))([-]?)\\d+\\.\\d{1,10}?)*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

/**
 * 
 */
public static void commaSeparatedNumbers() {
    String value = "-11, 22, -31";      
    if (value.matches("^([-]?)\\d+(((,)|(,\\s))([-]?)\\d+)*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

#5


1  

This

("((?:[^"\\])*(?:\\\")*(?:\\\\)*)*")

will capture all strings (within double quotes), including \" and \\ escape sequences. (Note that this answer assumes that the only escape sequences in your string are \" or \\ sequences -- no other backslash characters or escape sequences will be captured.)

将捕获所有字符串(在双引号内),包括“\”和\转义序列。(注意,这个答案假设您的字符串中唯一的转义序列是“\”或“\”序列——不会捕获其他反斜杠字符或转义序列。)

("(?:         # begin with a quote and capture...
  (?:[^"\\])* # any non-\, non-" characters
  (?:\\\")*   # any combined \" sequences
  (?:\\\\)*   # and any combined \\ sequences
  )*          # any number of times
")            # then, close the string with a quote

Try it out here!

在这里试试!

Also, note that maksymiuk's accepted answer contains an "edge case" ("Imagine trying to actually match a string which ends with a backslash") which is actually just a malformed string. Something like

另外,请注意,maksymiuk接受的答案包含一个“edge case”(“想象一下试图匹配一个以反斜杠结尾的字符串”),这实际上只是一个格式不正确的字符串。类似的

"this\"

...is not a "string ending on a backslash", but an unclosed string ending on an escaped quotation mark. A string which truly ends on a backslash would look like

…不是“以反斜杠结尾的字符串”,而是以转义引号结尾的未关闭字符串。一个真正以反斜线结束的字符串应该是这样的

"this\\"

...and the above solution handles this case.

…上面的解可以处理这种情况。


If you want to expand a bit, this...

如果你想扩大一点,这个…

(\\(?:b|t|n|f|r|\"|\\)|\\(?:(?:[0-2][0-9]{1,2}|3[0-6][0-9]|37[0-7]|[0-9]{1,2}))|\\(?:u(?:[0-9a-fA-F]{4})))

...captures all common escape sequences (including escaped quotes):

…捕获所有常见的转义序列(包括转义引号):

(\\                       # get the preceding slash (for each section)
  (?:b|t|n|f|r|\"|\\)     # capture common sequences like \n and \t

  |\\                     # OR (get the preceding slash and)...
  # capture variable-width octal escape sequences like \02, \13, or \377
  (?:(?:[0-2][0-9]{1,2}|3[0-6][0-9]|37[0-7]|[0-9]{1,2}))

  |\\                     # OR (get the preceding slash and)...
  (?:u(?:[0-9a-fA-F]{4})) # capture fixed-width Unicode sequences like \u0242 or \uFFAD
)

See this Gist for more information on the second point.

有关第二点的更多信息,请参阅本文的要点。