共计 5506 个字符,预计需要花费 14 分钟才能阅读完成。
对于某博,大家都十分的相熟,大家日常的生存中都喜爱,刷刷某博,看看微博上的实时热点,看看明天产生了什么新鲜事,又有什么瓜能够吃。
往年注定是内娱不平庸的一年,作为一名吃瓜大众示意,吃得我忽然感觉“累了”!大家总说各种综艺也好、各种大瓜也好,原本认为去年十月份开始的各种爱豆的塌房已是极限了。
但没有想到,这一开年:郑某、张某、张某某、华某某,也算是刷新了塌房的底线了。
作为“热搜绝缘体”的汪某老师示意:“我太难了”。只是不晓得下一次汪峰老师什么时候会有新的动静,到时又给咱们这些吃瓜大众带来怎么样的惊喜。
ok,废话不多说了,当热搜爆进去的时候,吃瓜大众会有什么样的反馈呢?明天咱们通过 python 爬取热点下的评论,来看看吃瓜大众都说了些什么。
以该链接为例:
https://m.weibo.cn/detail/4633967301757583
通过下图的形式获取爬取链接的 cookie 值,因代码中应用的 cookie 被解决过,所以可能会不可能获取到相干数据,请依据下图的形式获取本人的 cookie 值进行替换。
接着咱们切到 NetWork,对数据进行筛选,发现数据是通过接口进行的返回,这样就简略了,只须要构建申请接口所须要的数据即可,那么接口须要哪些数据呢?请看下图:
依据上图的剖析,咱们就能够获取到构建接口的参数,至于参数怎么构建就不作过多赘言,接下来让咱们进行编码环节:
这是小编发现的一个很好的代理平台 大家有须要能够支付下
# coding:utf-8
# __auth__ = "maiz"
import sys
import requests
import random
import time
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import re
class Weibo(object):
url = 'https://m.weibo.cn/comments/hotflow?' # 评论主链
url_chi = 'https://m.weibo.cn/comments/hotFlowChild?' # 子评论主链
user_agent = ["Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"UCWEB7.0.2.37/28/999",
"NOKIA5700/ UCWEB7.0.2.37/28/999",
"Openwave/ UCWEB7.0.2.37/28/999",
"Mozilla/4.0 (compatible; MSIE 6.0;) Opera/UCWEB7.0.2.37/28/999",
"Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
]
cookies = ['_T_WM=543292696212; WEIBOCN_FROM=1110006030; XSRF-TOKEN=82baa0; loginScene=102003; SUB=_2A25NtanXDeRhGeFI6lQZ8yrKyTyIHXVvWTefrDV6PUJbktB-LUSskW1NfVq9yQwFLX6ZFuL9fmFotKnrlgUo2uf2; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9Wh2DsSw6p8Yd5p9kFmFfS865JpX5KzhUgL.FoMceKqRe0Bceo52dJLoIp7LxKML1KBLBKnLxKqL1hnLBoMNSo2c1heXSoz7; SSOLoginState=1622268295; MLOGIN=1; M_WEIBOCN_PARAMS=oid%3D4635613573678655%26lfid%3D102803%26luicode%3D20000174']
headers = {'User-Agent': random.choice(user_agent),
'Cookie': random.choice(cookies)
}
params = {}
params_chi = {}
list_text = [] # 贮存评论文本
data_text = {}
cid_list = []
date_list = []
def get_max_id(self):
try:
try:
response = requests.get(url=self.url, headers=self.headers, params=self.params).json() # 爬取网页
# print(response)
except:
print('e')
response = requests.get(url=self.url, headers=self.headers, params=self.params).json() # 爬取网页
pass
max_id = response['data']['max_id'] # get 申请时的参数,从前一页 json 源码中获取
max_id_type = response['data']['max_id_type']
data = response['data']['data']
# print(len(data))
for i in range(0, len(data)):
text = data[i]['text']
date = data[i]['created_at']
self.data_text = date # 将评论日期贮存为字典的值,评论贮存为键
cid = data[i]['id']
# print(text)
self.list_text.append(text)
self.cid_list.append(cid)
# print(self.data_text)
except:
max_id = self.params['max_id']
max_id_type = self.params['max_id_type']
print('谬误!!!')
print('爬到:', max_id, max_id_type)
self.save_data()
sys.exit()
return max_id, max_id_type
def __init__(self):
# num_page = int(input('请输出你要爬的页数:'))
# ID = input('请输出你要爬取内容的 id:')
num_page = 5
ID = 4633967301757583
return_info = ('0', '0') # 给第一次爬取的页面初始参数, 若程序中途暂停,可输出返回的值从新爬取
for i in range(0, num_page):
print(f'正在爬取第 {i + 1} 页')
time.sleep(random.randint(0, 5)) # 反爬进展
self.params = {
'id': ID,
'mid': ID,
'max_id': return_info[0],
'max_id_type': return_info[1]
}
# print("----------------------")
# print(self.params)
# print("----------------------")
return_info = self.get_max_id()
print(f'第 {i+1} 页爬取结束')
self.save_data()
def save_data(self): # 保留数据
for text in self.data_text:
with open('weibo_10_9.txt', 'a', encoding='utf-8') as f:
f.write('\n')
date = self.data_text
# print(date)
f.write(text)
f.write('\t')
f.write(date)
f.write('\n')
f.write('\n')
m = Weibo()
# 导入文本数据并进行简略的文本处理
# 去掉换行符和空格
text = open("weibo_10_9.txt", encoding='utf8').read()
pattern = re.compile(r'[^\u4e00-\u9fa5|\n]') # \u4e00-\u9fa5 代表所有中文字符,\n 代表换行符。这里匹配所有的非中文与非换行符
text = re.sub(pattern, '', text)
text = text.replace('\n', "")
# 分词,返回后果为词的列表
text_cut = jieba.lcut(text)
# 将分好的词用某个符号宰割开连成字符串
text_cut = ' '.join(text_cut)
# 导入停词
# 用于去掉文本中的无用词
stop_words = open("stopwords.txt", encoding="utf8").read().split("\n")
# 应用 WordCloud 生成词云
word_cloud = WordCloud(
scale=12,
font_path=r"C:\Windows\Fonts\simhei.ttf", # 设置词云字体
background_color="white", # 词云图的背景色彩
stopwords=stop_words # 去掉的停词
)
wc = word_cloud.generate(text_cut)
# 使用 matplotlib 展示后果
plt.subplots(figsize=(12, 8))
plt.imshow(wc)
plt.axis("off")
plt.show()
plt.savefig('ci.png')
右击运行代码,即可在以后文件夹下,生成一个对于评论的 txt 文件,并杜绝 txt 内的内容生成一个词云图。代码获取回复:”某博评论爬取“。
以上就是明天给大家分享的内容.
正文完