python模块之re（正则表达式）

匹配模式re.ASCII同re.A，对应的内联标识为(?a)，用于向后兼容。使元字符\w, \W, \b, \B, \d, \D, \s和\S仅匹配ASCII字符。该模式只在string模式下有意义，在byte模式下将被忽略。re.DEBUG显示debug信息，没有对应的内联标识。re.IGNORECASE同re.I，对应的内联标识是(?i)。忽略大小写匹配，如表达式[A-Z]也会匹配小写的字母a-z。对Unicode字符同样生效(如’Ü’可以匹配’ü’)，除非指定了re.ASCII禁止匹配非ASCII字符。当前locale不会改变此标识的效果，除非指定了re.LOCALE。在string模式下[a-z],[A-Z]和IGNORECASE标识结合使用时，将匹配52个ASCII字母和4个非ASCII字母。re.LOCALE同re.L，对应的内联标识为(?L)。不推荐使用。re.MULTILINE同re.M，对应的内联标识为(?m)。多行模式，改变元字符^和$的行为。默认^只匹配字符串开始，指定后还会匹配每行的开始（换行符之后）；默认$只匹配字符串结尾，指定后还会匹配每行结尾（换行符之前）。re.DOTALL同re.S，对应的内联标识为(?s)。此模式下，元字符.匹配任意字符，包括换行符。re.VERBOSE同re.X，对应的内联标识为(?x)。冗余模式，此模式下可以在表达式中添加注释，使其更具可读性，但在编译时会忽略多余的空格和注释。模块级方法re.compile(pattern, flags=0)编译正则表达式pattern，返回一个SRE_Pattern对象。flags参数指定匹配模式。re.search(pattern, string, flags=0)扫描string参数，查找正则表达式pattern产生匹配的第一个结果，返回一个SRE_Match对象。如果返回None表示匹配失败re.match(pattern, string, flags=0)如果string参数开头的0个或多个字符匹配正则表达式pattern，返回一个SRE_Match对象。如果返回None表示匹配失败即使在MULTILINE模式下，match()函数也只会匹配字符串开头，而不会匹配每行开头re.fullmatch(pattern, string, flags=0)如果string参数整个匹配正则表达式pattern，返回一个SRE_Match对象。如果返回None表示匹配失败。re.split(pattern, string, maxsplit=0, flags=0)正则表达式pattern作为分隔符拆分string参数，返回拆分后的列表。maxsplit如果不为0，最多拆分maxsplit次，string参数的余下部分将作为列表的最后一个元素返回。如果在pattern中使用了分组(…)，返回列表中还会包含所有匹配的分组本身。>>> re.split(r’\W+’, ‘Words, words, words.’)[‘Words’, ‘words’, ‘words’, ‘’]>>> re.split(r’(\W+)’, ‘Words, words, words.’)[‘Words’, ‘, ‘, ‘words’, ‘, ‘, ‘words’, ‘.’, ‘’]>>> re.split(r’\W+’, ‘Words, words, words.’, 1)[‘Words’, ‘words, words.’]>>> re.split(’[a-f]+’, ‘0a3B9’, flags=re.IGNORECASE)[‘0’, ‘3’, ‘9’]如果pattern在字符串的开头匹配，那么返回列表第一个元素是空字符串；同样地，如果pattern在字符串末尾匹配，返回列表的最后一个元素是空字符串：>>> re.split(r’(\W+)’, ‘…words, words…’)[’’, ‘…’, ‘words’, ‘, ‘, ‘words’, ‘…’, ‘’]re.findall(pattern, string, flags=0)返回一个列表，按顺序排列所有成功的分组匹配。如果pattern参数中只有一个分组，列表元素为所有成功的分组匹配；如果存在超过一个以上的分组，列表元素为元组形式的各个分组匹配。如果返回空列表表示匹配失败>>> content = ‘333STR1666STR299’>>> regex = r’([A-Z]+(\d))’>>> re.findall(regex, content)[(‘STR1’, ‘1’), (‘STR2’, ‘2’)]>>> regex1 = r’[A-Z]+(\d)’>>> re.findall(regex1, content)[‘1’, ‘2’]# 如果正则表达式不含分组，视其整体为一个分组>>> regex2 = r’[A-Z]+\d’>>> re.findall(regex2, content)[‘STR1’, ‘STR2’]>>> regex3 = r’([A-Z]+\d)’>>> re.findall(regex3, content)[‘STR1’, ‘STR2’]re.finditer(pattern, string, flags=0)查找所有匹配成功的字符串, 返回一个迭代器，元素为SRE_Match对象。如果返回空迭代器表示匹配失败content = ‘333STR1666STR299’regex = r’([A-Z]+(\d))‘result = re.finditer(regex, content)for i in result: print(i.group(0))# STR1# STR2re.sub(pattern, repl, string, count=0, flags=0)使用pattern匹配原始字符串string，将匹配到的结果用repl替换，返回一个新的字符串。如果没有匹配返回原字符串。count是一个正整数，表示字符串替换的最大次数。repl可以是字符串或函数，如果是字符串，其中的的所有\都将进行转义处理，比如\n表示换行符，反向引用\6表示pattern匹配的第六个分组，而某些无意义的转义可能原样保留或导致异常：>>> re.sub(r’def\s+([a-zA-Z_][a-zA-Z_0-9])\s(\s*):’,… r’static PyObject*\npy_\1(void)\n{’,… ‘def myfunc():’)‘static PyObject*\npy_myfunc(void)\n{‘如果repl是函数，该函数接收单个SRE_Match对象为参数，pattern匹配到一次结果便会调用一次该函数，返回要替换的字符串：>>> def dashrepl(matchobj):… if matchobj.group(0) == ‘-’: return ’ ‘… else: return ‘-’>>> re.sub(’-{1,2}’, dashrepl, ‘pro—-gram-files’)‘pro–gram files’>>> re.sub(r’\sAND\s’, ’ & ‘, ‘Baked Beans And Spam’, flags=re.IGNORECASE)‘Baked Beans & Spam’re.subn(pattern, repl, string, count=0, flags=0)同sub()，但返回值为(new_string, number_of_subs_made)re.escape(pattern)转义特殊字符。re.purge()清空正则表达式缓存。异常exception re.error(msg, pattern=None, pos=None)属性msg：未格式化的错误信息pattern：正则表达式pos：导致异常的pattern索引位置，可能为Nonelineno：pos在第几行，可能为Nonecolno：pos在所在行的位置，可能为NonePattern对象方法Pattern.search(string[, pos[, endpos]])与模块级的search()类似。pos和endpos表示string参数的前endpos个字符中，从索引为pos的位置开始匹配，如果endpos小于等于pos，返回NonePattern.match(string[, pos[, endpos]])与模块级的match()类似。pos和endpos参数意义同search()>>> pattern = re.compile(“o”)>>> pattern.match(“dog”) # No match as “o” is not at the start of “dog”.>>> pattern.match(“dog”, 1) # Match as “o” is the 2nd character of “dog”.<re.Match object; span=(1, 2), match=‘o’>Pattern.fullmatch(string[, pos[, endpos]])与模块级的fullmatch()类似。pos和endpos参数意义同search()>>> pattern = re.compile(“o[gh]”)>>> pattern.fullmatch(“dog”) # No match as “o” is not at the start of “dog”.>>> pattern.fullmatch(“ogre”) # No match as not the full string matches.>>> pattern.fullmatch(“doggie”, 1, 3) # Matches within given limits.<re.Match object; span=(1, 3), match=‘og’>Pattern.split(string, maxsplit=0)与模块级的split()相同Pattern.findall(string[, pos[, endpos]])与模块级的findall()类似。pos和endpos参数意义同search()Pattern.finditer(string[, pos[, endpos]])与模块级的finditer()类似。pos和endpos参数意义同search()Pattern.sub(repl, string, count=0)与模块级的sub()相同Pattern.subn(repl, string, count=0)与模块级的subn()相同属性Pattern.flags：返回一个正整数，表示正则匹配模式。该值是compile()函数中pattern参数中的内联标识以及flags参数指定的模式，和隐式的re.UNICODE(如果pattern为Unicode字符串)的值的和>>> re.UNICODE<RegexFlag.UNICODE: 32>>>> re.IGNORECASE<RegexFlag.IGNORECASE: 2># 32 + 2>>> re.compile("", flags=re.IGNORECASE).flags34Pattern.groups：pattern中存在的分组数量Pattern.groupindex：正则表达式中所有命名分组名称和对应分组号的映射；如果没有使用命名分组，返回一个空字典>>> pattern = re.compile(r"(?P<first_name>\w+) (?P<last_name>\w+)")>>> pattern.groupindexmappingproxy({‘first_name’: 1, ’last_name’: 2})Pattern.pattern：编译pattern对象的正则表达式Match对象方法Match.expand(template)通过对template中的反斜杠引用进行替换，返回替换后的字符串。例如\n将转义为换行符，\1, \g<name>将替换为Match对象中对应的分组：>>> m = re.search("(b)+(z)?", “cba”)>>> m<re.Match object; span=(1, 2), match=‘b’>>>> m.expand(r’ab\1’)‘abb’>>> m.expand(r’ab\2’)‘ab’>>> print(m.expand(r’ab\n’))ab>>>Match.group([group1, …])返回Match对象的一个或多个子分组。如果传入单个参数，返回单个字符串；如果传入多个参数，返回一个元组，元组中的每个元素代表每个参数对应的分组。如果参数为0，返回值为pattern匹配的完整字符串如果参数在1-99范围内，返回对应分组匹配的字符串如果参数为负数或大于pattern中定义的分组数量，抛出IndexError异常如果对应分组无匹配，返回None如果一个分组匹配多次，只返回最后一次匹配的结果>>> m = re.match(r"(\w+) (\w+)(\d+)?", “Isaac Newton, physicist”)>>> m.group(0) # (1)‘Isaac Newton>>> m.group(1) # (2)‘Isaac’>>> m.group(2) # (2)‘Newton’>>> m.group(1, 2) # Multiple arguments give us a tuple.(‘Isaac’, ‘Newton’)>>> type(m.group(3)) # (4)<class ‘NoneType’>>>> m = re.match(r"(..)+", “a1b2c3”) # Matches 3 times.>>> m.group(1) # (5)‘c3’如果正则表达式中使用了(?P<name>…)，group()也支持通过分组名的方式访问分组，分组名不存在将抛出IndexError异常：>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", “Malcolm Reynolds”)>>> m.group(‘first_name’)‘Malcolm’>>> m.group(’last_name’)‘Reynolds’# 仍然可以通过索引访问>>> m.group(1)‘Malcolm’>>> m.group(2)‘Reynolds’Match.getitem(g)等同于group()，提供了更简单的访问分组的方式：>>> m = re.match(r"(\w+) (\w+)", “Isaac Newton, physicist”)>>> m[0] # The entire match’Isaac Newton’>>> m[1] # The first parenthesized subgroup.‘Isaac’>>> m[2] # The second parenthesized subgroup.‘Newton’>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", “Malcolm Reynolds”)>>> m[“first_name”]‘Malcolm’Match.groups(default=None)返回一个包含所有子分组的元组，元组长度等同于pattern中的分组数量；如果没有分组，返回空元组。default参数作为分组无匹配值时的默认值，默认为None：>>> m = re.match(r"(\d+).(\d+)", “24.1632”)>>> m.groups()(‘24’, ‘1632’)>>> m = re.match(r"(\d+).?(\d+)?", “24”)>>> m.groups() # Second group defaults to None.(‘24’, None)>>> m.groups(‘0’) # Now, the second group defaults to ‘0’.(‘24’, ‘0’)Match.groupdict(default=None)返回一个字典，key为pattern中定义的分组名称，value为分组的匹配值；如果没有使用命名元组，返回空字典。default参数作为分组无匹配值时的默认值，默认为None：>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", “Malcolm Reynolds”)>>> m.groupdict(){‘first_name’: ‘Malcolm’, ’last_name’: ‘Reynolds’}Match.start([group])Match.end([group])返回由group匹配的子字符串在原始字符串中的开始和结束索引。group默认为0，表示完整匹配结果。如果返回-1，表示group存在但没有匹配值如果m.start(group)等同于m.end(group)，表示group匹配一个空字符串>>> m = re.match(r"(\w+) (\w+)(\d)?", “Isaac Newton, physicist”)>>> m<re.Match object; span=(0, 12), match=‘Isaac Newton’># (1)>>> m.start()0>>> m.end()12# (2)>>> type(m[3])<class ‘NoneType’>>>> m.start(3)-1>>> m.end(3)-1# (3)>>> m[3]’’>>> m.start(3)12>>> m.end(3)12Match.span([group])返回(m.start(group), m.end(group))形式的元组，如果group不存在对应匹配值，返回(-1, -1)。group默认为0，表示完整匹配结果属性Match.pos：传递给Pattern对象的search(), match(), fullmatch()方法的pos参数Match.endpos：传递给Pattern对象的search(), match(), fullmatch()方法的endpos参数Match.lastindex：具有匹配值的最后一个分组的位置，如果没有任何分组匹配，返回None。>>> m = re.search(r"a(z)?", “ab”)>>> type(m.lastindex)<class ‘NoneType’>>>> m = re.match(r"(\w+) (\w+)(\d)?", “Isaac Newton, physicist”)>>> m.lastindex2Match.lastgroup：具有匹配值的最后一个分组的名称，如果没有命名分组或没有任何分组匹配，返回NoneMatch.re：创建当前Match对象的Pattern对象Match.string：进行匹配的原始字符串3.7版本re模块新特性Non-empty matches can now start just after a previous empty match：# python3.7之前>>> re.sub(‘x*’, ‘-’, ‘abxd’)’-a-b-d-’# python3.7>>> re.sub(‘x*’, ‘-’, ‘abxd’)’-a-b–d-‘Unknown escapes in repl consisting of ’’ and an ASCII letter now are errors：# python3.7之前>>> print(re.sub(r’\w+’, r’\d’, ‘ab&xd&’))\d&\d&# python3.7>>> print(re.sub(r’\w+’, r’\d’, ‘ab&xd&’))…re.error: bad escape \d at position 0Only characters that can have special meaning in a regular expression are escaped：# python3.7之前>>> print(re.escape("!#$%&"))!#$%&# python3.7>>> print(re.escape("!#$%&"))!#$%&Added support of splitting on a pattern that could match an empty string：# python3.7之前>>> re.split(r’\b’, ‘Words, words, words.’)…ValueError: split() requires a non-empty pattern match.>>> re.split(r’\W*’, ‘…words…’)[’’, ‘words’, ‘’]>>> re.split(r’(\W*)’, ‘…words…’)[’’, ‘…’, ‘words’, ‘…’, ‘’]# python3.7>>> re.split(r’\b’, ‘Words, words, words.’)[’’, ‘Words’, ‘, ‘, ‘words’, ‘, ‘, ‘words’, ‘.’]>>> re.split(r’\W*’, ‘…words…’)[’’, ‘’, ‘w’, ‘o’, ‘r’, ’d’, ’s’, ‘’, ‘’]>>> re.split(r’(\W*)’, ‘…words…’)[’’, ‘…’, ‘’, ‘’, ‘w’, ‘’, ‘o’, ‘’, ‘r’, ‘’, ’d’, ‘’, ’s’, ‘…’, ‘’, ‘’, ‘’]Added support of copy.copy() and copy.deepcopy(). Match objects are considered atomic