Python 3 Regex和Unicode字符。

Using Python 3, a simple script like the following should run as intended, but appears to choke on unicode emote strings:

使用Python 3，像下面这样的简单脚本应该按照预期运行，但似乎会阻塞unicode表情字符串:

import re

phrase = "(╯°□°)╯ ︵ ┻━┻"
pattern = r'\b{0}\b'.format(phrase)

text = "The quick brown fox got tired of jumping over dogs and flipped a table: (╯°□°)╯ ︵ ┻━┻"

if re.search(pattern, text, re.IGNORECASE) != None:
    print("Matched!")

If I substitute the word "fox" for the contents of the phrase variable, the pattern does indeed match. I've been puzzled as to why it doesn't like this particular string though, and my expeditions into the manual and Stack Overflow haven't illuminated the issue. From all I can tell, Python 3 should handle this without issue.

如果我用“fox”来替换短语变量的内容，那么这个模式确实是匹配的。我一直不明白为什么它不喜欢这个特殊的字符串，而我对手册和栈溢出的探索并没有阐明这个问题。从我所能知道的来看，Python 3应该可以毫无问题地处理这个问题。

Am I missing something painfully obvious?

我是不是错过了一些显而易见的东西?

Edit: Also, dropping the boundaries (\b) doesn't affect the ability to match the string either.

编辑:此外，删除边界(\b)也不会影响匹配字符串的能力。

1 个解决方案

#1

(╯°□°)╯ ︵ ┻━┻

This expression has brackets in them, you need to escape them. Otherwise they are interpreted as group.

这个表达式中有括号，你需要转义它们。否则他们就会被解释为群体。

In [24]: re.search(r'\(╯°□°\)╯ ︵ ┻━┻', text, re.IGNORECASE)
Out[24]: <_sre.SRE_Match object; span=(72, 85), match='(╯°□°)╯ ︵ ┻━┻'>

In [25]: re.findall(r'\(╯°□°\)╯ ︵ ┻━┻', text, re.IGNORECASE)
Out[25]: ['(╯°□°)╯ ︵ ┻━┻']

Escape the regex string properly and change your code to:

正确地转义regex字符串，并将代码更改为:

import re

phrase = "(╯°□°)╯ ︵ ┻━┻"
pattern = re.escape(phrase)

text = "The quick brown fox got tired of jumping over dogs and flipped a table: (╯°□°)╯ ︵ ┻━┻"

if re.search(pattern, text, re.IGNORECASE) != None:
    print("Matched!")

And then it will work as expected:

然后它会像预期的那样工作:

$ python3 a.py
Matched!

#1

(╯°□°)╯ ︵ ┻━┻