共计 3493 个字符,预计需要花费 9 分钟才能阅读完成。
有时候一部电影给人的启发, 比一本书给人的启发更大, 可能取得极高的评分, 就阐明这部电影取得了寰球观众的认可, 人生在遭逢蛊惑的时候, 一部高分电影能够给人解困。为了避免大家在家剧荒,小编应用 python 爬取了豆瓣的豆瓣电影排行 TOP250,首先让咱们输出网址:
https://movie.douban.com/top250?start=0&filter=
而后咱们剖析链接的法则:
https://movie.douban.com/top250?start=25&filter=
……
https://movie.douban.com/top250?start=50&filter=
……
https://movie.douban.com/top250?start=225&filter=
综上 start 值为 0,25,50…225,则看出数值的步长为 25,最大值为 225,找到链接的法则后,就让咱们来查看元素的获取。同样 F12 审查元素,
图片
由此咱们能够看出,外面的每个电影是由 ol 标签上面的 li 标签包裹着,咱们间接取到 li 的值就能够了,获取标签如下:
ol = soup.find('ol')
li = ol.find_all('li')
接下来让咱们进入编码环节:
import re
import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
from openpyxl.styles import Alignment
def top250():
wb = Workbook()
ws = wb['Sheet']
num = 0
num1 = 0
lst = []
name_lst = []
dy_lst = []
zy_lst = []
time_lst = []
country_lst = []
leixing_lst = []
pj_lst = []
people_lst = []
quote_lst = []
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 Edg/81.0.416.68",
}
while num <= 225:
url = 'https://movie.douban.com/top250?start=' + str(num) + '&filter='
with requests.get(url=url, headers=headers) as r:
if r.status_code == 200:
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text, 'html.parser')
ol = soup.find('ol')
li = ol.find_all('li')
for i in li:
name = i.find('span').text
p = i.find('p')
p = str(p).split('<br/>', 1)
fst = p[0].split('>', 1)[1]
sec = p[1].split('<', 1)[0]
daoyan = fst.split('主', 1)[0]
daoyan = daoyan.replace('导演:', '')
daoyan = daoyan.replace('\xa0', '')
daoyan = daoyan.replace('\n', '')
try:
zhuyan = fst.split('主演:', 1)[1]
except:
zhuyan = ''time = sec.split(' / ', 2)[0]
time = time.replace('\xa0', '')
time = time.replace('\n', '')
time = re.findall(r'\d{4}', time)[-1]
country = sec.split('\xa0/\xa0', 2)[1]
country = country.replace('\xa0', '')
type = sec.split('\xa0/\xa0', 2)[2]
type = type.replace('\xa0', '')
type = type.replace('\n', '')
star = i.find('div', attrs={'class': "star"})
span = star.find_all('span')
pingjia = span[1].text
people = span[3].text.split('评估', 1)[0]
try:
quote = i.find('p', attrs={'class': "quote"}).text
quote = quote.replace('\n', '')
except:
quote = ''
name_lst.append(name)
dy_lst.append(daoyan)
zy_lst.append(zhuyan)
time_lst.append(time)
country_lst.append(country)
leixing_lst.append(type)
pj_lst.append(pingjia)
people_lst.append(people)
quote_lst.append(quote)
num1 += 1
print('第 {} 页爬取结束!'.format(num1))
if num == 225:
print('爬取完结,开始写入 excel。。。')
paiming = list(range(1, 251))
lst.append(paiming)
lst.append(name_lst)
lst.append(dy_lst)
lst.append(zy_lst)
lst.append(time_lst)
lst.append(country_lst)
lst.append(leixing_lst)
lst.append(pj_lst)
lst.append(people_lst)
lst.append(quote_lst)
head = ['排名', '电影名称', '导演', '主演', '年份', '地区', '类型', '评分', '评估人数', '一句简介']
ws.append(head)
for i in range(len(lst)):
for j in range(len(lst[0])):
ws.cell(j + 2, i + 1).value = lst[i][j]
print('写入 excel 实现!')
for cell in ws['1']:
cell.alignment = Alignment(horizontal='center', vertical='center')
for cell in ws['A']:
cell.alignment = Alignment(horizontal='center', vertical='center')
for cell in ws['B']:
cell.alignment = Alignment(horizontal='center', vertical='center')
for cell in ws['E']:
cell.alignment = Alignment(horizontal='center', vertical='center')
for cell in ws['H']:
cell.alignment = Alignment(horizontal='center', vertical='center')
for cell in ws['I']:
cell.alignment = Alignment(horizontal='center', vertical='center')
ws.column_dimensions['B'].width = 25
ws.column_dimensions['I'].width = 13
wb.save('豆瓣电影 top250.xlsx')
num += 25
else:
print('失败!')
if __name__ == '__main__':
print('开始爬取!')
top250()
右击运行代码,即可执行,以后文件夹内会生成一个豆瓣电影 top250 xlsx 文件,所有电影的信息都爬取胜利。如下图即代表程序运行胜利。
图片
以上就是明天给大家分享的内容。
正文完