是否可以使[a-zA-Z] Python正则表达式模式匹配并替换非ASCII Unicode字符?

时间:2022-06-08 01:10:51

In the following regular expression, I would like each character in the string replaced with an 'X', but it isn't working.

在下面的正则表达式中,我希望字符串中的每个字符都替换为“X”,但它不起作用。

In Python 2.7:

在Python 2.7中:

>>> import re
>>> re.sub(u"[a-zA-Z]","X","dfäg")
'XX\xc3\xa4X'

or

>>> re.sub("[a-zA-Z]","X","dfäg",re.UNICODE)
u'XX\xe4X'

In Python 3.4:

在Python 3.4中:

>>> re.sub("[a-zA-Z]","X","dfäg")
'XXäX'

Is it possible to somehow 'configure' the [a-zA-Z] pattern to match 'ä', 'ü', etc.? If this can't be done, how can I create a similar character range pattern between square brackets that would include Unicode characters in the usual 'full alphabet' range? I mean, in a language like German, for instance, 'ä' would be placed somewhere close to 'a' in the alphabet, so one would expect it to be included in the 'a-z' range.

有可能以某种方式'配置'[a-zA-Z]模式以匹配'ä','ü'等?如果无法做到这一点,我如何在方括号之间创建一个类似的字符范围模式,包括通常的“完整字母”范围内的Unicode字符?我的意思是,例如,在像德语这样的语言中,'ä'将被放置在字母表中接近'a'的某个位置,因此可以预期它将包含在'a-z'范围内。

1 个解决方案

#1


4  

You may use

你可以用

(?![\d_])\w

With the Unicode modifier. The (?![\d_]) look-ahead is restricting the \w shorthand class so as it could not match any digits (\d) or underscores.

使用Unicode修饰符。 (?![\ d_])前瞻限制\ w速记类,因为它无法匹配任何数字(\ d)或下划线。

See regex demo

请参阅正则表达式演示

A Python 3 demo:

Python 3演示:

import re
print (re.sub(r"(?![\d_])\w","X","dfäg"))
# => XXXX

As for Python 2:

至于Python 2:

# -*- coding: utf-8 -*-
import re
s = "dfäg"
w = re.sub(ur'(?![\d_])\w', u'X', s.decode('utf8'), 0, re.UNICODE).encode("utf8")
print(w)

#1


4  

You may use

你可以用

(?![\d_])\w

With the Unicode modifier. The (?![\d_]) look-ahead is restricting the \w shorthand class so as it could not match any digits (\d) or underscores.

使用Unicode修饰符。 (?![\ d_])前瞻限制\ w速记类,因为它无法匹配任何数字(\ d)或下划线。

See regex demo

请参阅正则表达式演示

A Python 3 demo:

Python 3演示:

import re
print (re.sub(r"(?![\d_])\w","X","dfäg"))
# => XXXX

As for Python 2:

至于Python 2:

# -*- coding: utf-8 -*-
import re
s = "dfäg"
w = re.sub(ur'(?![\d_])\w', u'X', s.decode('utf8'), 0, re.UNICODE).encode("utf8")
print(w)