python正则表达式re.sub各个参数的超详细讲解_Python

一、re.sub(pattern, repl, string, count=0, flags=0)
二、参数讲解
- 1、pattern参数
- 2、repl参数
  - 2.1、repl是字符串
  - 2.2、repl是函数
- 3、string参数
- 4、count参数
- 5、flags参数
  - 5.1、IGNORECASE（简写I）
  - 5.2、LOCALE（简写L）
  - 5.3、MULTILINE（简写M）
  - 5.4、DOTALL（简写S）
  - 5.5、VERBOSE（简写X）
补充：repl为函数时的用法
总结

一、re.sub(pattern, repl, string, count=0, flags=0)

re是正则的表达式，sub是substitute，表示替换

re.sub共有五个参数。

re.sub(pattern, repl, string, count=0, flags=0)

其中三个必选参数：pattern, repl, string

两个可选参数：count, flags

二、参数讲解

1、pattern参数

pattern，表示正则中的模式字符串，这个没太多要解释的。

需要知道的是：

反斜杠加数字：\N，则对应着匹配的组：matched group

比如\6，表示匹配前面pattern中的第6个group

意味着，pattern中，前面肯定是存在对应的组，后面也才能去引用

举个例子

hello xinfa, nihao xinfa

我们想把xinfa替换成linxinfa，就可以这样：

				?

									import re

									inputStr = "hello xinfa, nihao xinfa"

									replacedStr = re.sub(r"hello (\w+), nihao \1", "linxinfa", inputStr)

									print("replacedStr = ", replacedStr) 

									#输出结果为: replacedStr = linxinfa

注意，上面的(\w+)，括号括起来表示一个组；

里面的\w表示匹配字母、数字、下划线，等价于[A-Za-z0-9_]；

然后+表示匹配前面的子表达式一次或多次。

所以(\w+)就是匹配多个字母、数字、下划线的意思。表达式中的\1表示匹配第一个组，第一个组就是(\w+)。

2、repl参数

repl，就是replacement，被替换的字符串的意思。

repl可以是字符串，也可以是函数。

2.1、repl是字符串

如果repl是字符串的话，其中的任何反斜杠转义字符，都会被处理。

比如：

\n：会被处理为对应的换行符；

\r：会被处理为回车符；

其他不能识别的转移字符，则只是被识别为普通的字符：比如\j，会被处理为j这个字母本身；

比较特殊的是\g<n>，\g表示匹配组，n是组的id，比如\g<1>表示第一个组。

还是上面的例子，我们想把xinfa提取出来，只剩xinfa

hello xinfa, nihao xinfa

就可以这样写：

				?

									import re

									inputStr = "hello xinfa, nihao xinfa"

									replacedStr = re.sub(r"hello (\w+), nihao \1", "\g<1>", inputStr)

									print("replacedStr = ", replacedStr) 

									#输出结果为: replacedStr = xinfa

2.2、repl是函数

比如输入内容是：

hello 123 world 456

想要把其中的数字部分，都加上111，变成：

hello 234 world 567

那么就可以这样：

				?

									#!/usr/bin/python

									# -*- coding: utf-8 -*-

									import re;

									def pythonReSubDemo():

									    """

									        demo Pyton re.sub

									    """

									    inputStr = "hello 123 world 456"

									    def _add111(matched):

									        intStr = matched.group("number")

									        intValue = int(intStr)

									        addedValue = intValue + 111

									        addedValueStr = str(addedValue)

									        return addedValueStr

									    replacedStr = re.sub("(?P<number>\d+)", _add111, inputStr)

									    print("replacedStr=",replacedStr) 

									    #输出结果为：replacedStr= hello 234 world 567

									if __name__=="__main__":

									    pythonReSubDemo()

注意上面，用了一个?P<value>。

?P的意思就是命名一个名字为value的组，匹配规则符合后面的\d+。

3、string参数

string，即表示要被处理，要被替换的那个string字符串。

4、count参数

举例说明：

继续之前的例子，假如对于匹配到的内容，只处理其中一部分。

比如：

hello 123 world 456 nihao 789

我们只是想要处理前面两个数字：123，456，分别给他们加111，而不处理789，

那么就可以这样：

				?

									#!/usr/bin/python

									# -*- coding: utf-8 -*-

									import re;

									def pythonReSubDemo():

									    """

									        demo Pyton re.sub

									    """

									    inputStr = "hello 123 world 456 nihao 789"

									    def _add111(matched):

									        intStr = matched.group("number")

									        intValue = int(intStr)

									        addedValue = intValue + 111

									        addedValueStr = str(addedValue)

									        return addedValueStr

									    replacedStr = re.sub("(?P<number>\d+)", _add111, inputStr, 2)

									    print("replacedStr = ", replacedStr)

									    #输出结果为：replacedStr = hello 234 world 567 nihao 789

									if __name__=="__main__":

									    pythonReSubDemo()

5、flags参数

flags是编译标志。编译标志让你可以修改正则表达式的一些运行方式。

在re模块中标志可以使用两个名字，一个是全名如IGNORECASE，一个是缩写，一字母形式如I。（如果你熟悉 Perl 的模式修改，一字母形式使用同样的字母；例如re.VERBOSE的缩写形式是re.X。）

多个标志可以通过按位或它们来指定。如re.I | re.M被设置成I和M标志。

下面列举下常用的编译标志。

5.1、IGNORECASE（简写I）

使匹配对大小写不敏感；

举个例子，[A-Z]也可以匹配小写字母，Spam可以匹配 Spam、spam或spAM。

5.2、LOCALE（简写L）

locales是C语言库中的一项功能，是用来为需要考虑不同语言的编程提供帮助的。

举个例子，如果你正在处理法文文本，你想用 w+来匹配文字，但w只匹配字符类[A-Za-z]，它并不能匹配é。

如果你的系统配置适当且本地化设置为法语，那么内部的 C函数将告诉程序é也应该被认为是一个字母。

当在编译正则表达式时使用 LOCALE标志会得到用这些 C函数来处理 w后的编译对象，这会更慢，但也会象你希望的那样可以用w+来匹配法文文本。

5.3、MULTILINE（简写M）

MULTILINE多行的意思，改变 ^ 和 $ 的行为。

使用 ^只匹配字符串的开始，而 $则只匹配字符串的结尾和直接在换行前（如果有的话）的字符串结尾。

当本标志指定后，^匹配字符串的开始和字符串中每行的开始。同样的， $元字符匹配字符串结尾和字符串中每行的结尾（直接在每个换行之前）。

例如

				?

									import re

									s='hello \nworld \nxinfa'

									print(s)

									pattern=re.compile(r'^\w+')

									print(re.findall(pattern,s))

									#加上flags=re.M

									pattern=re.compile(r'^\w+', flags=re.M)

									print(re.findall(pattern,s))

输出结果为

hello
world
xinfa
['hello']
['hello', 'world', 'xinfa']

5.4、DOTALL（简写S）

此模式下 .的匹配不受限制，可匹配任何字符，包括换行符，也就是默认是不能匹配换行符。

例：

				?

									#!/usr/bin/python

									# -*- coding: utf-8 -*-

									import re

									s = '''first line

									    ...: second line

									    ...: third line'''

									regex=re.compile('.+')

									print(regex.findall(s))

									regex=re.compile('.+', re.S)

									print(regex.findall(s))

输出结：

['first line', ' ...: second line', ' ...: third line']
['first line\n ...: second line\n ...: third line']

5.5、VERBOSE（简写X）

冗余模式，此模式忽略正则表达式中的空白和#号的注释。
例：

				?

									email_regex = re.compile("[\w+\.]+@[a-zA-Z\d]+\.(com|cn)")

									email_regex = re.compile("""[\w+\.]+  # 匹配@符前的部分

									                            @  # @符

									                            [a-zA-Z\d]+  # 邮箱类别

									                            \.(com|cn)   # 邮箱后缀  """, re.X)

补充：repl为函数时的用法

当repl为函数时的替换更加灵活，此时可以在函数中自定义在某种特定的匹配下替换为某种特定的字符。

示例

				?

									import re

									# 将匹配的数字乘以 2

									def double(matched):

									    print('matched: ',matched)

									    print("matched.group('value'): ",matched.group('value'))

									    value = int(matched.group('value'))

									    return str(value * 2)

									string = 'A23G4HFD567'

									print(re.sub('(?P<value>\d+)', double, string))