关于爬虫:python爬虫-Day-7

正则表达式下

地位匹配和非贪心匹配

地位匹配

有时候须要对呈现的地位有要求，比方结尾、结尾、单词等

表达式	匹配
^	在字符串开始的中央匹配，符号自身不匹配任何字符
$	在字符串完结的中央匹配，符号自身不匹配任何字符
\b	匹配一个单词边界，也就是单词和空格之间的地位，符号自身不匹配任何字符
\B	匹配非单词边界，即左右两边都是\w范畴或者左右两边都不是\w范畴时的字符缝隙

非贪心匹配

1.贪心匹配--定义
在反复匹配时，正则表达式默认总是尽可能多的匹配，这被称为贪心匹配
2.非贪心匹配--定义
匹配尽可能少的字符 (应用?)

re模块罕用办法

办法	形容	返回值
compile	依据蕴含正则表达式的字符串创立模式对象	re对象
search	在字符串中查找	第一个匹配的对象或者None
match	在字符串的开始处匹配模式	在字符串结尾匹配到的对此昂或者None
split	依据模式的匹配项来宰割字符串	宰割后的字符串列表
findall	列出字符串中模式的所有匹配项	所有匹配到的字符串列表
sub	将字符串所有的pat匹配项用repl替换	实现替换后的新字符串

flag匹配模式

默认状况下 flags=0

匹配模式	形容
re.A	ASCII字符模式
re.I	使匹配对大小写不敏感，也就是不辨别大小写的模式
re.L	做本地化辨认（locale-aware）匹配
re.M	多行匹配，影响^和$
re.S	使.这个通配符可能匹配包含换行在内的所有字符，针对多行匹配
re.U	依据Unicode字符集解析字符。这个标记影响\w，\W,\b,\B
re.X	该标记通过给予你更灵便的格局以便你将正则表达式写得更易于了解

分组性能

1.定义
是指去曾经匹配到的内容再筛选出须要的内容，相当于二次过滤
2.过程
（1）实现分组靠圆括号()
（2）获取分组的内容靠的是group()、groups()
3.留神
re模块里的几个重要办法在分组上，有不同的表现形式，须要区别对待

代码

（1）代码greedy

import re# s 示意待匹配的数据s = '<div>abc</div><div>def</div>'# 需要：<div>abc</div># .用来匹配任意不换行的一个字符 * 匹配0到任意次ptn1 = r'<div>.*</div>'result1 = re.match(ptn1, s)print(result1.group())# 贪心匹配 始终匹配 返回最长的匹配后果# 非贪心匹配 匹配尽可能少的字符 那么如何失去非贪心匹配呢# ? .*? .+? {m,n}?ptn2 = r'<div>.*?</div>'result2 = re.match(ptn2, s)print(result2.group())

（2）代码re_examples_1

import re# 地位匹配# result1 = re.match(r'\d{11}', '12345678910')# print(result1.group())# result2 = re.match(r'1[2-9]\d{9}', '12345678910')# print(result2.group())# result3 = re.match(r'1[2-9]\d{9}', '11345678910')# print(result3.group())   # 报错# result4 = re.match(r'^1[2-9]\d{9}$', '12345678910')# print(result4.group())# result5 = re.match(r'^1[2-9]\d{9}', '123456789101')# print(result5.group())# result6 = re.match(r'^1[2-9]\d{9}$', '123456789101')# print(result6.group())   # 报错 $ 示意只能有那么多位数字

（3）代码re_examples_2

import redef fn(ptn, lst):    for i in lst:        result = re.match(ptn, i)        if result:            print('匹配胜利！匹配后果为：', result.group())        else:            print('匹配失败！')# . 用来匹配除了换行符以外的任意字符# lst =['abc', 'abbb', 'other', 'add',  'aba']# ptn = r'ab.'# fn(ptn, lst)# [] 用来匹配[]列举的字符# lst = ['man', 'mbn', 'mdn', 'mun', 'nba']# ptn = r'm[abcf]n'# fn(ptn, lst)# ^ 取反# lst = ['man', 'mbn', 'mdn', 'mun', 'nba']# ptn = r'm[^abcf]n'# fn(ptn, lst)# \d  用来匹配一个数字[0-9]# lst = ['py5', 'py24', 'pyxxx', 'other']# ptn = r'py\d'# fn(ptn, lst)# lst = ['7.1', '569', '4.51', '555']# ptn = r'\d\.\d'# fn(ptn, lst)# \D 匹配非数字 \d的补集# lst = ['py.5', 'py2', 'pyxxx', 'other']# ptn = r'py\D'# fn(ptn, lst)# \s 匹配空白等# lst = ['hello world', 'helloxxxx', 'hello,world', 'hello\fworld', 'hello\tworld', 'hello\nworld']# ptn = r'hello\sworld'# fn(ptn, lst)# \S# lst = ['hello world', 'helloxxxx', 'hello,world', 'helloxworld']# ptn = r'hello\Sworld'# fn(ptn, lst)# \w 匹配字母数字下划线# lst = ['1-age', 'a-age', '&-age', '_-age',' -age']# ptn = r'\w-age'# fn(ptn, lst)# \W# lst = ['1-age', 'a-age', '&-age', '_-age',' -age']# ptn = r'\W-age'# fn(ptn, lst)# * 呈现0到任意次# lst = ['hello', 'java', 'python', 'h', 'xxx']# ptn = r'[hp][a-z]*'# fn(ptn, lst)# + 至多呈现一次# lst = ['hello', 'java', 'python', 'h', 'xxx']# ptn = r'[hp][a-z]+'# fn(ptn, lst)# {n} 反复匹配n次# lst = ['hello', 'java', 'python', 'h', 'xxxxxxxx']# ptn = r'\w{6}'# fn(ptn, lst)# {m,n} 大于等于m次 小于等于n次  但其实如果是在根本符号前面的话 只有后面合乎表达式 前面有多少位都能够# 在python中 默认是贪心的# lst = ['hello', 'jav', 'python', 'h', 'xxxxxxxx']# ptn = r'\w{3,4}'# fn(ptn, lst)# 地位匹配# lst = ['123@qq.com', 'abc@yy.com', 'xxx@qq.com.cn']# ptn = r'\w+@qq.com'# fn(ptn, lst)lst = ['123@qq.com', 'abc@yy.com', 'xxx@qq.com.cn']ptn = r'\w+@qq.com$'  #  非贪心fn(ptn, lst)

（4）代码re_examples_3

import re# compile(pattern, flags=0)# pat = re.compile(r'abc')   # 正则表达式# print(pat)   # re.compile('abc')# x1 = pat.match('abc123')# print(x1)   # <re.Match object; span=(0, 3), match='abc'># x2 = pat.match('abc123').group()  # 待匹配表达式# print(x2)   # abc# x3 = pat.match('ABC123').group()# print(x3)   # 报错 大小写辨别 该如何解决呢？--不辨别大小写 re.I# pat2 = re.compile(r'abc', re.I)# x4 = pat2.match('ABC123').group()# print(x4)    # ABC# match(pattern, string, flags=0)# match会从头就开始匹配 返回第一个胜利匹配的后果 否则就返回None# y1 = re.match(r'abc', '123abc456abc789').group()# print(y1)  # 报错# y2 = re.match(r'abc', 'abc456abc789').group()# print(y2)  # abc# y3 = re.match(r'abc', 'abcabcabc789').group()# print(y3)  # abc  为什么不会贪心匹配呢？ \d 只是用来匹配单个字符# search(pattern, string, flags=0)# 只有待匹配的数据外面有正则表达式 就能够匹配胜利 且只会打印第一次匹配胜利的数据# z1 = re.search(r'abc', '123abc456abc789').group()# print(z1)# ^ 示意结尾 即结尾必须是abc# z2 = re.search(r'^abc', '123abc456abc789').group()# print(z2)   # 报错# z3 = re.search(r'^abc', 'abc456abc789').group()# print(z3)# findall(pattern, string, flags=0)# 寻找待匹配表达式中合乎正则表达式的所有内容 同时以列表的模式打印进去# r1 = re.findall(r'abc', '123abc456abc789').group()# print(r1)   # 报错 因为list不能够应用group办法# r2 = re.findall(r'abc', '123abc456abc789')# print(r2)  # ['abc', 'abc'] 能够抉择遍历或者索引# r3 = re.findall(r'abcd', '123abc456abc789')# print(r3)  # 匹配失败就返回空列表# split(pattern, string, maxsplit=0, flags=0)# s = '1*2-3+4/5'   # 需要：将数字独自拿进去 多种形式# findall# e1 = re.findall(r'\d{1,}', s)# print(e1)# e2 = re.findall(r'\d+', s)# print(e2)# e3 = re.findall(r'\d', s)# print(e3)# e4 = re.findall(r'\d*', s)# print(e4)  # 打印后果有问题 含有0（无）# split# s1 = re.split(r'[\+\-\*\/]', s)  # 留神要本义# print(s1)# maxsplit 最大宰割# u1 = re.split(r'[\+\-\*\/]', s, maxsplit=2)# print(u1)  # ['1', '2', '3+4/5'] 切两刀# u2 = re.split(r'[\+\-\*\/]', s, maxsplit=0)# print(u2)  # 不变# u3 = re.split(r'[\+\-\*\/]', s, maxsplit=3)# print(u3)  # >=4 都一样 不变# sub(pat, repl, string, count=0, flags=0)s = "i can't stand my poor python, i like python"l1 = re.sub(r'i', 'I', s)print(l1)   # 代替 i代替I

（5）代码re_group

import re# s是待匹配的数据s = 'apple price is $22, banana price is $33'# 需要： 获取$22 $33# 分组就是去匹配到的数据外面再次进行筛选 相当于二次过滤result1 = re.search(r'.*\$\d+.*\$\d+', s)print(result1)  # <re.Match object; span=(0, 39), match='apple price is $22, banana price is $33'>result2 = re.search(r'.*(\$\d+).*(\$\d+)', s)  # 进行分组（）print(result2.group())print(result2.group(0))print(result2.group(1))print(result2.group(2))print(result2.groups())print(result2.groups(0))print(result2.group()[0])  # 是指索引为0的字符 和上述不一样"""result.group()/result.group(0) 匹配整个分组result.group(1) $22 匹配第一个分组result.group(2) $33 匹配第二个分组result.groups()/result.groups(0) ('$22', '$33') 获取所有的分组"""