正则备忘录walker | 乐趣区

　　下面的例子默认以 python 为实现语言，用到 python 的 re 模块或 regex 库。据 walke r猜测：在python3 的 Unicode 字符集下，re模块的 s 匹配 fnrtv 加全角半角空格，共 7 个字符。

正则表达式的文档

正则表达式30分钟入门教程
另一个不错的入门教程
揭开正则表达式的神秘面纱，walker 觉得这篇文章对 Multiline 的讲解特别到位，截图如下：

提取双引号及之间的内容

用 re.findall

text = '''abc"def"ghi'''re.findall(r'"[^"]+"', text)# 结果['"def"']

用re.search。

>>> text = '''abc"def"ghi'''>>> re.search(r'"([^"]+)"', text).group(0)'"def"'

提取双引号之间的内容

用 re.findall。

text = '''abc"def"ghi'''re.findall(r'"([^"]+)"', text)# 结果['def']

用 re.search。

>>> text = '''abc"def"ghi'''>>> re.search(r'"([^"]+)"', text).group(1)'def'

环视: (?<=pattern)、(?=pattern)

text = '''abc"def"ghi'''re.findall(r'(?<=")[^"]+(?=")', text)# 结果['def']

查找以某些字符串打头的行

# 比如查找以+++、---、index打头的行#方法一，按行匹配for i in lst:    if re.match(r"(---|\+\+\+|index).*", i):        print i#方法二，一次性匹配re.findall(r'^(?:\+\+\+|---|index).*$', content, re.M)#方法二精简版re.findall(r'^(?:[-\+]{3}|index).*$', content, re.M)

包含/不包含

（参考：利用正则表达式排除特定字符串）

文本内容

>>> print(text)www.sina.com.cnwww.educ.orgwww.hao.ccwww.baidu.comwww.123.comsina.com.cneduc.orghao.ccbaidu.com123.com

匹配以www打头的行

>>> re.findall(r'^www.*$', text, re.M)['www.sina.com.cn', 'www.educ.org', 'www.hao.cc', 'www.baidu.com', 'www.123.com']

匹配不以www打头的行

>>> re.findall(r'^(?!www).*$', text, re.M)['', 'sina.com.cn', 'educ.org', 'hao.cc', 'baidu.com', '123.com']

匹配以cn结尾的行

>>> re.findall(r'^.*?cn$', text, re.M)['www.sina.com.cn', 'sina.com.cn']

匹配不以com结尾的行

>>> re.findall(r'^.*?(?<!com)$', text, re.M)['www.sina.com.cn', 'www.educ.org', 'www.hao.cc', '', 'sina.com.cn', 'educ.org', 'hao.cc']

匹配包含com的行

>>> re.findall(r'^.*?com.*?$', text, re.M)['www.sina.com.cn', 'www.baidu.com', 'www.123.com', 'sina.com.cn', 'baidu.com', '123.com']

匹配不包含com的行

>>> re.findall(r'^(?!.*com).*$', text, re.M)['www.educ.org', 'www.hao.cc', '', 'educ.org', 'hao.cc']>>> re.findall(r'^(?:(?!com).)*?$', text, re.M)['www.educ.org', 'www.hao.cc', '', 'educ.org', 'hao.cc']

匹配全部，去除部分

利用分组得到网址的第一级，即去除后面几级。

# 方法一>>> strr = 'http://www.baidu.com/abc/d.html'>>> re.findall(r'(http://.+?)/.*', strr)['http://www.baidu.com']# 方法二>>> re.sub(r'(http://.+?)/.*', r'\1', strr)'http://www.baidu.com'

两个有助于理解正则分组的例子

# 一>>> strr = 'A/B/C'>>> re.sub(r'(.)/(.)/(.)', r'xx', strr)'xx'>>> re.sub(r'(.)/(.)/(.)', r'\1xx', strr)'Axx'>>> re.sub(r'(.)/(.)/(.)', r'\2xx', strr)'Bxx'>>> re.sub(r'(.)/(.)/(.)', r'\3xx', strr)'Cxx'# 二>>> text = 'AA,BB:222'>>> re.search(r'(.+),(.+):(\d+)', text).group(0)'AA,BB:222'>>> re.search(r'(.+),(.+):(\d+)', text).group(1)'AA'>>> re.search(r'(.+),(.+):(\d+)', text).group(2)'BB'>>> re.search(r'(.+),(.+):(\d+)', text).group(3)'222'

提取含有hello字符串的div

>>> content'<div id="abc"><div id="hello1"><div id="def"><div id="hello2"><div id="hij">'>>> >>> p = r'<div((?!div).)+hello.+?>'>>> re.search(p, content).group()'<div id="hello1">'>>> re.findall(p, content)['"', '"']>>> for iter in re.finditer(p, content):    print(iter.group())<div id="hello1"><div id="hello2">>>> >>> p = r'<div[^>]+hello.+?>'>>> re.search(p, content).group()'<div id="hello1">'>>> re.findall(p, content)['<div id="hello1">', '<div id="hello2">']>>> for iter in re.finditer(p, content):    print(iter.group())<div id="hello1"><div id="hello2">

如果所使用的工具支持肯定环视（positive lookahead），同时可以在肯定环视中使用捕获括号（capturing parentheses），就能模拟实现固化分组（atomic grouping）和占有优先量词（possessive quantifiers）。

千分位

Python

>>> format(23456789, ',')'23,456,789'# 利用肯定逆序环视与肯定顺序环视>>> re.sub(r'(?<=\d)(?=(?:\d{3})+$)', ',', '2345678')'2,345,678'

JavaScript

//利用肯定顺序环视（因为js不支持肯定逆序环视）//结果为"23,456,789""23456789".replace(/(\d)(?=(?:\d{3})+$)/g, "$1,")

单层嵌套括号（平衡组）

>>> import re>>> line = r'盖层(汽油) 塔里木盆地(学科: 盖层(油气) 学科: 评价) 塔里木盆地'>>> re.findall(r'\([^()]*(\([^()]*\)[^()]*)*\)', line)['', '(油气) 学科: 评价']>>> re.findall(r'\([^()]*(?:\([^()]*\)[^()]*)*\)', line)['(汽油)', '(学科: 盖层(油气) 学科: 评价)']

匹配汉字

>>> regex.findall(r'\p{Han}', '孔子/现代价值/Theory of "Knowing"')['孔', '子', '现', '代', '价', '值']

一种正则和 lambda 的有趣结合

dic = {'user': 'walker', 'domain': '163.com'}rule = r'%user%@%domain%'email = re.sub('%[^%]*%', lambda matchobj: dic[matchobj.group(0).strip('%')], rule)print('email: %s' % email)      # walker@163.com