正则表达式使用双引号和单引号混淆

时间:2022-09-15 16:38:36

I have this JavaScript (running in Chrome 48.0.2564.103 m):

我有这个JavaScript(在Chrome 48.0.2564.103 m中运行):

var s1 = 'label1="abc" label2=\'def\' ';
var s2 = 'label1="abc" label2=\'def\' label3="ghi"';
var re = /\b(\w+)\b=(['"]).*?def.*?\2/;

re.exec(s1); // --> ["label2='def'", "label2", "'"]
re.exec(s2); // --> ["label1="abc" label2='def' label3="", "label1", """]

The first exec() matches label2, as I intended. However, the second gets confused by the double quote after 'label3=' and matches label1 instead.

第一个exec()匹配label2,正如我所预期的那样。然而,第二个被'label3 ='之后的双引号混淆并且匹配label1。

I had expected the use of .*? to tell the regular expression to make the match as tightly as possible, but clearly it doesn't always. Is there a way to tighten up my regular expression?

我曾预料到会使用。*?告诉正则表达式使匹配尽可能紧密,但显然并非总是如此。有没有办法收紧我的正则表达式?

2 个解决方案

#1


3  

Just exclude what was seen as a quote

只要排除被视为引用的内容

/\b(\w+)\b=(['"])(?:.(?!\2))*def(?:.(?!\2))*.?\2/

So the change was replacing your .*? with (?:.(?!\2))*.

所以这个改变正在取代你的。*?用(?:。(?!\ 2))*。

Break down:

分解:

  • (?!) is negative look ahead, non-capturing
  • (?!)是负向前看,非捕捉
  • (?:) is non-capturing group.
  • (?:)是非捕获组。
  • The last letter right before the closing quote would not match if it's not def, need .? to fix
  • 如果不是def,则需要在结束报价之前的最后一个字母不匹配。修理

This allows you to combine other rules when you want to allow a='\'' or a="\"" or further a="\\\"":

这允许您在希望允许a''\''或a =“\”“或进一步a =”\\\“”时组合其他规则:

/\b(\w+)\b=(['"])(?:\\\\|\\\2|.(?!\2))*def(?:\\\\|\\\2|.(?!\2))*.?\2/

#2


3  

The reason s2 gives a different result is that you add a " on the right side of the "def" after label2, which allows the pattern to correctly match everything between the first and last double quote in the string.

s2给出不同结果的原因是你在label2之后添加了一个“def”的右侧,它允许模式正确匹配字符串中第一个和最后一个双引号之间的所有内容。

I can only guess that the reason a sparse match (?) doesn't have any effect is that at that point the regex engine has already decided to match " rather than '. Regex does its thing left-to-right after all.

我只能猜测稀疏匹配(?)没有任何影响的原因是,在那一点上,正则表达式引擎已经决定匹配“而不是”。毕竟,正则表达式从左到右完成。

The "simplest" way of solving this is to match only non-quotes, rather than using ., between the quotes:

解决这个问题的“最简单”方法是在引号之间仅匹配非引号,而不是使用。:

var re = /\b(\w+)\b=(['"])[^'"]*def[^'"]*\2/;

re.exec(s1); // --> ["label2='def'", "label2", "'"]
re.exec(s2); // --> ["label2='def'", "label2", "'"]

The problem with this is that now you can't put any kind of quotes in the value, even if they are perfectly legal:

这样做的问题是,现在你不能在值中加入任何类型的引号,即使它们是完全合法的:

// This won't match because of the " after def
var s2 = 'label1="abc" label2=\'def"\' label3="ghi"'

// This won't match because there's an escaped single quote in the value
var s2 = 'label1="abc" label2=\'def\\\'\' label3="ghi"'

But basically, regex isn't made for parsing HTML, so if these limitations are a problem you should look into proper parsing.

但基本上,正则表达式不是用于解析HTML,因此如果这些限制是一个问题,您应该考虑正确的解析。

#1


3  

Just exclude what was seen as a quote

只要排除被视为引用的内容

/\b(\w+)\b=(['"])(?:.(?!\2))*def(?:.(?!\2))*.?\2/

So the change was replacing your .*? with (?:.(?!\2))*.

所以这个改变正在取代你的。*?用(?:。(?!\ 2))*。

Break down:

分解:

  • (?!) is negative look ahead, non-capturing
  • (?!)是负向前看,非捕捉
  • (?:) is non-capturing group.
  • (?:)是非捕获组。
  • The last letter right before the closing quote would not match if it's not def, need .? to fix
  • 如果不是def,则需要在结束报价之前的最后一个字母不匹配。修理

This allows you to combine other rules when you want to allow a='\'' or a="\"" or further a="\\\"":

这允许您在希望允许a''\''或a =“\”“或进一步a =”\\\“”时组合其他规则:

/\b(\w+)\b=(['"])(?:\\\\|\\\2|.(?!\2))*def(?:\\\\|\\\2|.(?!\2))*.?\2/

#2


3  

The reason s2 gives a different result is that you add a " on the right side of the "def" after label2, which allows the pattern to correctly match everything between the first and last double quote in the string.

s2给出不同结果的原因是你在label2之后添加了一个“def”的右侧,它允许模式正确匹配字符串中第一个和最后一个双引号之间的所有内容。

I can only guess that the reason a sparse match (?) doesn't have any effect is that at that point the regex engine has already decided to match " rather than '. Regex does its thing left-to-right after all.

我只能猜测稀疏匹配(?)没有任何影响的原因是,在那一点上,正则表达式引擎已经决定匹配“而不是”。毕竟,正则表达式从左到右完成。

The "simplest" way of solving this is to match only non-quotes, rather than using ., between the quotes:

解决这个问题的“最简单”方法是在引号之间仅匹配非引号,而不是使用。:

var re = /\b(\w+)\b=(['"])[^'"]*def[^'"]*\2/;

re.exec(s1); // --> ["label2='def'", "label2", "'"]
re.exec(s2); // --> ["label2='def'", "label2", "'"]

The problem with this is that now you can't put any kind of quotes in the value, even if they are perfectly legal:

这样做的问题是,现在你不能在值中加入任何类型的引号,即使它们是完全合法的:

// This won't match because of the " after def
var s2 = 'label1="abc" label2=\'def"\' label3="ghi"'

// This won't match because there's an escaped single quote in the value
var s2 = 'label1="abc" label2=\'def\\\'\' label3="ghi"'

But basically, regex isn't made for parsing HTML, so if these limitations are a problem you should look into proper parsing.

但基本上,正则表达式不是用于解析HTML,因此如果这些限制是一个问题,您应该考虑正确的解析。