python 之 正则表达式入门

时间:2022-06-19 21:53:23

正则表达式确实很强大,需要好好学习。但是其语法多样,完全涵盖的话需要一本书去讲解,作为入门级别,为了降低学习曲线以及提高自信心,打算先总结最常用的,最实用的,并且主要用于序列处理的一些知识点。

比较常用的符号

标准库文档参考
HOW-TO

.

In the default mode, this matches any character except a newline.

^

Matches the start of the string

$

Matches the end of the string

*

match 0 or more repetitions of the preceding RE

+

match 1 or more repetitions of the preceding RE

?

match 0 or 1 repetitions of the preceding RE

{m}

Specifies that exactly m copies of the previous RE should be matched

{m,n}

Causes the resulting RE to match from m to n repetitions of the preceding RE

>转义符,但是它有两个非常重要的作用。第一就是转义,比如 '*' 就可以 match '*' 单纯这个符号。 第二个作用就是代表特殊序列集合,比如 '\d' 就是指的所有数字, 和[0-9] 的效果是一样的。

[ ]

Used to indicate a set of characters. 在这个符合内部的字符都可以参与 match, 也可以使用 '-' 来表示一个范围,比如[a-z], 上面的几个符合在里面被当作原始符合对待,[(+*)] will match any of the literal characters '(', '+', '*', or ')'. 特殊的, [^5] will match any character except '5', 这里 '^' 被当作反义符

|

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B

( )

Matches whatever regular expression is inside the parentheses.主要作为一个group, 方便阅读,尽量把正则式放在 () 内部。
***

special sequences

The special sequences consist of '' and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, $ matches the character '$'.

\d

Matches any decimal digit; this is equivalent to the class [0-9]

\D

Matches any non-digit character; this is equivalent to the class [^0-9]

\s

Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]

\S

Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]

\w

Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W

Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]

以上就是最常用的一些符号,需要在实际使用中慢慢熟练掌握。


以下主要是各种常见的 re 这个module 方法
标准库文档参考
HOW-TO

re.compile(pattern, flags=0)

文档

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods
用法如下:
prog = re.compile(pattern)
result = prog.match(string)
这个和下面的是相同的:
result = re.match(pattern, string)
但是 using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

re.search(pattern, string, flags=0)

文档

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object.Return None if no position in the string matches the pattern
比如:

    >>> import re
    >>> a = 'abcddddbcdbcda'
    >>> re.search('d', a)
    <_sre.SRE_Match object; span=(3, 4), match='d'>
    >>> RE = re.search('d', a)
    >>> RE.start()
    3
    >>> RE.end()
    4
    >>> RE.span()
    (3, 4)
    >>> RE.group()
    'd'
    

re.match(pattern, string, flags=0)

文档

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern,If you want to locate a match anywhere in string, use search() instead.

比如:

    >>> import re
    >>> a = 'abcddddbcdbcda'
    >>> re.match('d', a)
    >>> re.match('a', a)
    <_sre.SRE_Match object; span=(0, 1), match='a'>
    >>> re.match('ab', a)
    <_sre.SRE_Match object; span=(0, 2), match='ab'>
    

注意: re.search() 和 re.match() 返回的都是一个 match object, 对它的操作主要有以下几点:

  1. group() Return the string matched by the RE
  2. start() Return the starting position of the match
  3. end() Return the ending position of the match
  4. span() Return a tuple containing the (start, end) positions of the match

re.split(pattern, string, maxsplit=0, flags=0)

文档

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

比如:

    >>> import re
    >>> a = ' Words, words: words;   words!  '
    >>> re.split(',' , a)
    [' Words', ' words: words;   words!  ']
    >>> re.split('\W' , a)
    ['', 'Words', '', 'words', '', 'words', '', '', '', 'words', '', '', '']
    >>> re.split('\W+' , a)
    ['', 'Words', 'words', 'words', 'words', '']
    >>> re.split('(\W+)' , a)
    ['', ' ', 'Words', ', ', 'words', ': ', 'words', ';   ', 'words', '!  ', '']
    >>> re.split('\W+' , a, 2)
    ['', 'Words', 'words: words;   words!  ']

re.findall(pattern, string, flags=0)

文档

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found

比如:

    >>> import re
    >>> a = 'ABAACAAADAAAAE'
    >>> re.findall('A+', a)
    ['A', 'AA', 'AAA', 'AAAA']
    >>> re.findall('AA', a)
    ['AA', 'AA', 'AA', 'AA']

re.finditer(pattern, string, flags=0)

文档

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found.

比如:

    >>> import re
    >>> a = 'ABAACAAADAAAAE'
    >>> re.finditer('A', a)
    <callable_iterator object at 0xb7277bec>
    >>> RE = re.finditer('A', a)
    >>> for match in RE:
    ...     print(match)
    ... 
    <_sre.SRE_Match object; span=(0, 1), match='A'>
    <_sre.SRE_Match object; span=(2, 3), match='A'>
    <_sre.SRE_Match object; span=(3, 4), match='A'>
    <_sre.SRE_Match object; span=(5, 6), match='A'>
    <_sre.SRE_Match object; span=(6, 7), match='A'>
    <_sre.SRE_Match object; span=(7, 8), match='A'>
    <_sre.SRE_Match object; span=(9, 10), match='A'>
    <_sre.SRE_Match object; span=(10, 11), match='A'>
    <_sre.SRE_Match object; span=(11, 12), match='A'>
    <_sre.SRE_Match object; span=(12, 13), match='A'>
    

re.sub(pattern, repl, string, count=0, flags=0)

文档

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth.

比如:

    >>> import re
    >>> a = 'ABAACAAADAAAAE'
    >>> re.sub('A', 'a', a)
    'aBaaCaaaDaaaaE'
    >>> re.sub('A', 'a', a, count=2)
    'aBaACAAADAAAAE'
    >>> re.sub('C', '\n', a)
    'ABAA\nAAADAAAAE'
    >>> c = re.sub('C', '\n', a)
    >>> print(c)
    ABAA
    AAADAAAAE