如何在python正则表达式中正确使用unicode字符串

时间:2021-10-17 01:39:00

I am getting an input regular expression from a user which is saved as a unicode string. Do I have to turn the input string into a raw string before compliling it as a regex object? Or is it unnecessary? Am I converting it to raw string properly?

我从用户获得一个输入正则表达式,该表达式保存为unicode字符串。在将输入字符串作为正则表达式对象进行压缩之前,是否必须将输入字符串转换为原始字符串?还是没必要?我是否正确地将其转换为原始字符串?

import re
input_regex_as_unicode = u"^(.){1,36}$"
string_to_check = "342342dedsfs"

# leave as unicode
compiled_regex = re.compile(input_regex_as_unicode)
match_string = re.match(compiled_regex, string_to_check)

# convert to raw
compiled_regex = re.compile(r'' + input_regex_as_unicode)
match_string = re.match(compiled_regex, string_to_check)

@Ahsanul Haque, my question is more regular expression specific, whether the regex handles the unicode string properly when converting it into a regex object

@Ahsanul Haque,我的问题是更具规则性的表达式,正则表达式在将其转换为正则表达式对象时是否正确处理unicode字符串

1 个解决方案

#1


1  

The re module handles both unicode strings and normal strings properly, you do not need to convert them to anything (but you should be consistent in your use of strings).

re模块正确处理unicode字符串和普通字符串,你不需要将它们转换成任何东西(但你应该在使用字符串时保持一致)。

There is no such a thing like "raw strings". You can use raw string notation in your code if it helps you with strings containing backslashes. For instance to match a newline character you could use '\\n', u'\\n', r'\n' or ur'\n'.

没有像“原始字符串”这样的东西。如果它可以帮助您使用包含反斜杠的字符串,则可以在代码中使用原始字符串表示法。例如,要匹配换行符,您可以使用'\\ n',u'\ n',r'\ n'或ur'\ n'。

Your use of the raw string notation in your example does nothing since r'' and '' evaluate to the same string.

在示例中使用原始字符串表示法没有任何作用,因为r''和''计算为相同的字符串。

#1


1  

The re module handles both unicode strings and normal strings properly, you do not need to convert them to anything (but you should be consistent in your use of strings).

re模块正确处理unicode字符串和普通字符串,你不需要将它们转换成任何东西(但你应该在使用字符串时保持一致)。

There is no such a thing like "raw strings". You can use raw string notation in your code if it helps you with strings containing backslashes. For instance to match a newline character you could use '\\n', u'\\n', r'\n' or ur'\n'.

没有像“原始字符串”这样的东西。如果它可以帮助您使用包含反斜杠的字符串,则可以在代码中使用原始字符串表示法。例如,要匹配换行符,您可以使用'\\ n',u'\ n',r'\ n'或ur'\ n'。

Your use of the raw string notation in your example does nothing since r'' and '' evaluate to the same string.

在示例中使用原始字符串表示法没有任何作用,因为r''和''计算为相同的字符串。