Beautifulsoup

【CTF】广度搜索的 BeautifulSoup 网站爬虫

本人习惯使用pyhton2进行编程，因此beautifulsoup也是使用python2版本的，但据说python2明年就要停止支持了，忧伤得很。。。0x01 题目要求如图所示，在页面源码当中找到5个flag，然后拼接起来，还给了flagA的示例。flagA：打开站点是一个ctf-wiki的demo站点，了解这个站的人应该都知道它的体量，所以手动一个个找是不现实的，需要用到爬虫了（题目名称也暗示了）。0x02 解题思路我考虑使用广度优先搜索（BFS）实现一个网站爬虫，不了解广度搜索的童鞋可以自行百度。具体实现方法如下：建立待请求链接visiting_urls和已请求链接visited_urls的2个列表（也可看做队列）从visiting_urls取出一条链接，使用requrests.get请求页面源码在源码中正则匹配flag字段beautifulsoup获取页面中所有的a标签，符合要求的加入visiting_urlsvisiting_urls不为空，则执行[2]当中需要考虑2个问题：去重问题：当爬取链接时，难免会遇到存在不同位置的url指向同一个页面，爬取时不需要再请求相同页面，因此要对爬取到的url进行去重。方法如下：维护visiting_urls visited_urls列表，比对爬取url与已爬取过的url是否重复；根据mkdocs网站url特点，包含"../“的是回溯链接，此类链接不需要再次请求。正则匹配问题：这个方面没有多想，写个能使用的正则匹配规则就行，在本题中需要2种正则匹配：匹配flag：flag[ABCDE]，我的目的是匹配到flag的标志，而不是把flag整个都匹配出来，因为我不清楚flag当中有没有其他奇怪字符，防止出现漏匹配的情况；匹配url：[\w/]+index.html，目的是匹配路径为字母数字（不包含”.."）且末尾是"index.html"的url。到此，整个任务就完成了。0x03 完整脚本#coding=utf-8import requests,refrom bs4 import BeautifulSoups = requests.session()s.keep_alive=Falseflagre = re.compile(‘flag[ABCDE]’)urlre = re.compile(’[\w/]+index.html’)base_url = ‘http://23.236.125.55:1000/ctf-wiki/‘flagA_url = ‘http://23.236.125.55:1000/ctf-wiki/assembly/mips/readme/index.html’visiting_urls = [‘http://23.236.125.55:1000/ctf-wiki/index.html’]visited_urls = []def find_flag(url,html): flist = flagre.findall(html) if len(flist) > 0: print flist,urldef BFS(): url = visiting_urls[0] del(visiting_urls[0]) visited_urls.append(url) r = s.get(url) #r.encoding = ‘utf-8’ find_flag(url,r.text) soup = BeautifulSoup(r.text,’lxml’) for a in soup.find_all(‘a’): link = a[‘href’] if urlre.findall(link) and “..” not in link: new_url = base_url + link if new_url not in visited_urls and new_url not in visiting_urls: visiting_urls.append(new_url)if name == ‘main’: while len(visiting_urls) > 0: BFS()上面思路已经提到了，该脚本只能提取到包含flag标志的页面，而不是flag本身，因此还需要手动访问这些页面去寻找flag（手动狗头），如果还想直接显示flag，那就需要优化一下正则匹配了。提示一点，在获取到页面源码后，使用r.encoding = ‘utf-8’转码会导致EOFError，具体原因不详，本想能够匹配中文页面，结果画蛇添足搞了半天以为匹配没成功。提示两点，requests.session()的好处，相较于直接requests.get()，可以防止建立过多的HTTP连接，导致新连接无法建立的问题。参考页面：https://segmentfault.com/q/10…执行效果如下：最后拼接一下，完事了。 ...

Python爬虫笔记4-BeautifulSoup使用

BeautifulSoup介绍与lxml一样，BeautifulSoup也是一个HTML/XML的解析器，主要功能也是如何解析和提取HTML/XML数据。几种解析工具的对比工具速度难度正则表达式最快困难 BeautifulSoup慢最简单 lxml快简单 lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。安装我的环境是Python 3.6.5，windows下cmd里执行pip安装即可。pip3 install beautifulsoup4测试python终端里导入beautifulsoup，无报错信息即安装成功。>>from bs4 import BeautifulSoup>>BeautifulSoup对象BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:TagNavigableStringBeautifulSoupCommentBeautifulSoup 对象表示的是一个文档的内容。大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag。Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。TagTag可以简单理解为HTML文档中的一个个的标签，比如：<head><title>The Dormouse’s story</title></head><ur><li class=“item-0”><a href=“link1.html”>first item</a></li></ur>上面HTML文档中的head、title、ur、li都是HTML标签(节点名称)，这些标签加上里面的内容就是tag。获取Tags# 导入模块from bs4 import BeautifulSouphtml = “”"<html><head><title>The Dormouse’s story</title></head><body>The Dormouse’s storyOnce upon a time there were three little sisters; and their names were<a href=“http://example.com/elsie" class=“sister” id=“link1”><!– Elsie –></a>,<a href=“http://example.com/lacie" class=“sister” id=“link2”>Lacie</a> and<a href=“http://example.com/tillie" class=“sister” id=“link3”>Tillie</a>;and they lived at the bottom of a well.…”””# 初始化BeautifulSoup对象，指定lxml解析器soup = BeautifulSoup(html, ’lxml’)# prettify()方法格式化soup的内容print(soup.prettify())# soup.title选出title节点print(soup.title)# <title>The Dormouse’s story</title>print(type(soup.title))# <class ‘bs4.element.Tag’>print(soup.head)# <head><title>The Dormouse’s story</title></head>print(soup.p)# The Dormouse’s story说明：使用soup加节点名称可以获取节点内容，这些对象的类型是bs4.element.Tag，但是它查找的是在内容中第一个符合要求的节点。比如上面代码有多个p标签，但是它只查找了第一个p标签。对于Tag有两个重要的属性，name和attrs。当选择一个节点后，name属性获取节点的名称，attrs属性获取节点的属性(以字典形式返回)。print(soup.name)# [document] #soup 对象本身比较特殊，它的 name 即为 [document]print(soup.head.name)# head #对于其他内部标签，输出的值便为标签本身的名称 print(soup.p.attrs)# {‘class’: [’title’], ’name’: ‘dromouse’}# 在这里，我们把 p 标签的所有属性打印输出了出来，得到的类型是一个字典。# 下面三种方法都可以获取字典里的值，是等价的，结果都一样print(soup.p.get(‘class’))# [’title’]print(soup.p[‘class’])# [’title’]print(soup.p.attrs[‘class’])# [’title’]# 还可以针对属性或者内容进行修改soup.p[‘class’] = “newClass"print (soup.p)# The Dormouse’s storyNavigableString获取了Tag，也就是获取了节点内容，但是只想要获取节点内部的内容怎么办？只需使用.string即可。# 获取节点内容print(soup.p.string)# The Dormouse’s storyprint(type(soup.p.string))# <class ‘bs4.element.NavigableString’>遍历文档树在选取节点的时候，也可以先选取一个节点，然后以这个节点为基准选取它的子节点，父节点，子孙节点等等，下面就介绍常用的选取方法。获取直接子节点.contents .children属性.contentstag的.contents属性可以将tag的直接子节点以列表的方式输出。下面例子选取head节点为基准，.contents选取head的子节点title，然后以列表返回。print(soup.head.contents)# [<title>The Dormouse’s story</title>]输出方式为列表，可以用列表索引来获取它的某一个元素.print(soup.head.contents[0])# <title>The Dormouse’s story</title>.childrenchildren属性和contents属性不同的是它返回的不是一个列表，而是一个生成器。可用for循环输出结果。print(soup.head.children)# <list_iterator object at 0x0000017415655588>for i in soup.head.children: print(i)# <title>The Dormouse’s story</title> 获取所有子孙节点：.descendants属性上面两个属性都只能获取到基准节点的下一个节点，要想获取节点的所有子孙节点，就可以使用descendants属性了。它返回的也是一个生成器。print(soup.descendants)# <generator object descendants at 0x0000028FFB17C4C0>还有其他属性如查找父节点，组父节点的属性就不记录了(平时很少用)。搜索文档树BeautifulSoup提供了一些查询方法(find_all,find等)，调用对应方法，输入查询参数就可以得到我们想要的内容了，可以理解为搜索引擎的功能。(百度/谷歌=查询方法，查询内容=查询参数，返回的网页=想要的内容)下面介绍最常用的find_all方法。find_all方法作用：查找所有符合条件的元素，返回的是列表形式API：find_all(name, attrs, recursive, text, **kwargs)1. namename 参数可以根据节点名来查找元素。A. 传字符串最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签。print(soup.find_all(‘p’))# 通常以下面方式写比较好print(soup.find_all(name=‘p’))B.传正则表达式如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以p开头的标签。import reprint(soup.find_all(re.compile(’^p’)))C.传列表如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回。下面代码会找到HTML代码中的head标签和b标签。print(soup.find_all([‘head’,‘b’]))# [<head><title>The Dormouse’s story</title></head>, The Dormouse’s story]2. attrsfind_all中attrs参数可以根据节点属性查询。查询时传入的参数是字典类型。比如查询id=link1的节点print(soup.find_all(attrs={‘id’:’link1’}))# [<a class=“sister” href=“http://example.com/elsie” id=“link1”><!– Elsie –></a>]对于常见的属性，可以不用以attrs来传递，直接传入查询参数即可。比如id,class_(class为Python关键字，使用下划线区分),如下:print(soup.find_all(id=‘link1’))print(soup.find_all(class_=‘sister’))运行结果：[<a class=“sister” href=“http://example.com/elsie” id=“link1”><!– Elsie –></a>][<a class=“sister” href=“http://example.com/elsie” id=“link1”><!– Elsie –></a>, <a class=“sister” href=“http://example.com/lacie” id=“link2”>Lacie</a>, <a class=“sister” href=“http://example.com/tillie” id=“link3”>Tillie</a>]3. texttext 参数可以搜搜文档中的字符串内容，与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表。下面代码查找节点里内容中有story字符串的节点，并返回节点的内容。print(soup.find_all(text=re.compile(‘story’)))# [“The Dormouse’s story”, “The Dormouse’s story”]find方法find方法与find_all方法的区别：find_all：查询符合所有条件的元素，返回列表。find:只查找第一个匹配到的元素，返回单个元素，类型tag。查询方法与find_all大同小异。示例：print(soup.find(name=‘p’)) # 查询第一个p标签print(soup.find(text=re.compile(‘story’))) # 查找第一个节点内容中有story字符串的节点内容运行结果：The Dormouse’s storyThe Dormouse’s story关于BeautifulSoup的使用就这样吧，常用个人就觉得用好find_all即可(=.=~)参考链接崔庆才 [Python3网络爬虫开发实战]：4.2-使用Beautiful Soup ...

weekly 2019-02-15

我开始学习Python了，这样我可以用它作为后端语言来学，也可以用来刷Leetcode，学爬虫等等这周我学习了:Python基础语法BeautiuilSoup Request库学习记录在这里前期语法还不熟，慢慢来，有空就刷刷题

BeautifulSoup4 入门

BeautifulSoup是Python包里最有名的HTML parser分解工具之一。简单易用安装：pip install beautifulsoup4注意大小写，而且不要安装BeautifulSoup，因为BeautifulSoup代表3.0，已经停止更新。常用语法参考我之前的文章：BeautifulSoup ：一些常用功能的使用和测试# 创建实例soup = BeautifulSoup(html, ‘html5lib’)选择器根据不同的网页，选择器的使用会很不同：绝大部分下使用CSS选择器select()就足够了如果按照标签属性名查找，而属性名中有-等特殊字符，那么就只能使用find()选择器了。# 最佳选择器: CSS选择器（返回tag list）results = soup.select(‘div[class*=hello_world] ~ div’)for tag in results: print(tag.string) #print the tag’s html string # print(tag.get_text()) #print its inner text#单TAG精确选择器：返回单个tag. tag = soup.find(‘div’, attrs={‘class’: ‘detail-block’})print(tag.get_text())# 多Tag精确选择器: 返回的是text，不是tagresults = soup.find_all(‘div’, attrs={‘class’: ‘detail-block’})# 多class选择器(标签含有多个Class)，重点是"class*=“results = soup.select(‘div[class*=hello_world] ~ div’)获取值tag = soup.find(‘a’)# 只获取标签的文本内容text = tag.get_text()# 获取标签的全部内容(如<a href=‘sdfj’> asdfa</a>)s = tag.string# 获取标签的属性link = tag[‘href’]修改值参考：Beautiful Soup（四）–修改文档树tag = soup.find(‘a’, attrs={‘class’: ‘detail-block’})#修改属性tag[‘href’] = ‘https://google.com’# 修改内容 <tag>..</tag>中间的内容tag.string = ‘New Content’# 删除属性del tag[‘class’]对象类型在我们使用选择器搜索各类tag标签时，BeautifulSoup会根据使用的函数而返回不同类型的变量。而不同的变量的使用方法也需要注意。Tag类型（<class ‘bs4.element.Tag’>）:tag.stringtag.get_text()可遍历字符串类型（bs4.element.NavigableString）:Comment类型（<class ‘bs4.element.Comment’>）:增删改标签参考：使用BeautifulSoup改变网页内容# 修改标签内容tag = soup.find(’title’)tag.string = ‘New Title’ ...

爬取百度热点新闻排行榜

import requestsfrom bs4 import BeautifulSoupdef get_html(url,headers): r = requests.get(url,headers=headers) r.encoding = r.apparent_encoding return r.textdef get_pages(html): soup = BeautifulSoup(html,‘html.parser’) all_topics=soup.find_all(’tr’)[1:] for each_topic in all_topics: topic_times = each_topic.find(’td’, class_=‘last’) # 搜索指数 topic_rank = each_topic.find(’td’, class_=‘first’) # 排名 topic_name = each_topic.find(’td’, class_=‘keyword’) # 标题目 if topic_rank != None and topic_name != None and topic_times != None: topic_rank = each_topic.find(’td’, class_=‘first’).get_text().replace(’ ‘, ‘’).replace(’\n’, ‘’) topic_name = each_topic.find(’td’, class_=‘keyword’).get_text().replace(’ ‘, ‘’).replace(’\n’, ‘’) topic_times = each_topic.find(’td’, class_=‘last’).get_text().replace(’ ‘, ‘’).replace(’\n’, ‘’) # print(‘排名：{}，标题：{}，热度：{}’.format(topic_rank,topic_name,topic_times)) tplt = “排名：{0:^4}\t标题：{1:{3}^15}\t热度：{2:^8}” print(tplt.format(topic_rank, topic_name, topic_times, chr(12288)))def main(): #百度热点排行榜单链接 url = ‘http://top.baidu.com/buzz?b=1&fr=20811' headers = {‘User-Agent’: ‘Mozilla/5.0’} html = get_html(url, headers) get_pages(html)if name == ‘main’: main() ...

Python爬取王者荣耀英雄皮肤高清图片

前言临下班前，看到群里有人在讨论用王者农药的一些皮肤作为电脑的壁纸，什么高清的，什么像素稍低的，网上查了一手，也有，但像素都不一样，所以，我就想着，自己去官网直接爬他的高清皮肤就好了，然后就有了这边文章说的主题了。爬图思路找到英雄列表进入官网，然后进入英雄介绍，查看更多英雄，就能看到全部的英雄了，也就是下面的这个链接英雄列表：https://pvp.qq.com/web201605/herolist.shtml英雄详情点击每个英雄进来，就可以看到每个英雄的详细信息，基本介绍以及皮肤展示，而我们需要爬取的皮肤，就在右下角那里，鼠标放上去，就可以逐个展示该皮肤了小鲁班的详细信息：https://pvp.qq.com/web201605/herodetail/112.shtml分析皮肤图片URL从上面的这张鲁班的图片中我们可以看到，通过F12定位到皮肤的小图片位置，li元素里有一个img的元素，其中img的src和data-imgname这两个属性，查看一下，就不难知道，src的属性值是小图，而data-imgname则是我们需要的大图URL，但是查看源码，就会发现，在html中，并没有这个属性，所以，需要我们分析这个URL的规律来得到其他英雄的皮肤图片，分析也不难发现，112就是英雄的id，而bigskin-2里面的2即表示这个英雄的第几张皮肤图片开始编写爬虫脚本第一步：定义一些常用变量第二步：抓取所有英雄列表第三步：循环遍历，分析每个英雄皮肤节点第四步：下载图片第五步：爬虫结束完整源码感觉上面七七八八的，说了些啥呀，真是墨迹，还不如直接上代码实在，好吧，我错了，马上交出源码，请各位看官饶恕，同时，代码我也上传了交友网站GitHub。#!/usr/bin/env python# -- coding: utf-8 --“““抓取王者荣耀皮肤author: gxcuizydate: 2018-11-06"““import requestsfrom bs4 import BeautifulSoupfrom urllib import parseimport osclass Skin(object): def init(self): # 英雄的json数据 self.hero_url = ‘https://pvp.qq.com/web201605/js/herolist.json' # 英雄详细页的通用url前缀信息 self.base_url = ‘https://pvp.qq.com/web201605/herodetail/' # 英雄详细页url后缀信息 self.detail_url = ’’ # 图片存储文件夹 self.img_folder = ‘skin’ # 图片url的通用前缀 self.skin_url = ‘https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/' # 图片url的后缀信息 self.skin_detail_url = ’’ def get_hero(self): “““获取英雄的json数据””” request = requests.get(self.hero_url) hero_list = request.json() return hero_list def get_hero_skin(self, hero_name, hero_no): “““获取详细页英雄皮肤展示的信息，并爬图””” url = parse.urljoin(self.base_url, self.detail_url) request = requests.get(url) request.encoding = ‘gbk’ html = request.text # 获取皮肤信息的节点 soup = BeautifulSoup(html, ’lxml’) skip_list = soup.select(’.pic-pf-list3’) for skin_info in skip_list: # 获取皮肤名称 img_names = skin_info.attrs[‘data-imgname’] name_list = img_names.split(’|’) skin_no = 1 # 循环下载皮肤图片 for skin_name in name_list: self.skin_detail_url = ‘%s/%s-bigskin-%s.jpg’ % (hero_no, hero_no, skin_no) skin_no += 1 img_name = hero_name + ‘-’ + skin_name + ‘.jpg’ self.download_skin(img_name) def download_skin(self, img_name): “““下载皮肤图片””” img_url = parse.urljoin(self.skin_url, self.skin_detail_url) request = requests.get(img_url) if request.status_code == 200: print(‘download-%s’ % img_name) img_path = os.path.join(self.img_folder, img_name) with open(img_path, ‘wb’) as img: img.write(request.content) else: print(‘img error!’) def make_folder(self): “““创建图片存储文件夹””” if not os.path.exists(self.img_folder): os.mkdir(self.img_folder) def run(self): “““脚本执行入口””” self.make_folder() hero_list = self.get_hero() for hero in hero_list: hero_no = str(hero[’ename’]) self.detail_url = hero_no + ‘.shtml’ hero_name = hero[‘cname’] self.get_hero_skin(hero_name, hero_no)# 程序执行入口if name == ‘main’: skin = Skin() skin.run()最后其实思路就是这么简单，当然了，如果有其他思路以及想法的，欢迎留言交流。额，差点忘了，大家有兴趣的，可以尝试一下爬取英雄联盟的所有英雄皮肤高清图片，有其他任何问题，也欢迎留言和交流。 ...