共计 3654 个字符,预计需要花费 10 分钟才能阅读完成。
当你想查看某个用户写的评论,但发现设置仅自己可见,外人看不了的时候,这个时候,我们可以通过写一个 python 程序来实现这个操作。有需要找我代查(w-x:fas1024)可以加我,下面是开发实例:
我们可以发现,这些评论是通过向
music.163.com/weapi/v1/resource/comments/R_SO_4_26075485?csrf_token=
发起 post 请求得到的,期间还传入两个参数,params 和 encSecKey
也就是说我们只要通过模拟浏览器向网易云服务器发送 post 请求就能获得评论!
这里还要注意这个 post 的链接,R_SO_4_ 之后跟的一串数字实际上就是这首歌曲对应的 id;而且这里需要传入的参数,也得好好分析一下(在后面)
第一步
代码如下:
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}
baseUrl = ‘https://music.163.com’
def getHtml(url):
r = requests.get(url, headers=headers)
html = r.text
return html
def getUrl():
# 从最新歌单开始
startUrl = 'https://music.163.com/discover/playlist/?order=new'
html = getHtml(startUrl)
pattern =re.compile('<li>.*?<p.*?class="dec">.*?<.*?title="(.*?)".*?href="(.*?)".*?>.*?span class="s-fc4".*?title="(.*?)".*?href="(.*?)".*?</li>',re.S)
result = re.findall(pattern,html)
pageNum = re.findall(r'<span class=”zdot”.?class=”zpgi”>(.?)’,html,re.S)[0]
info = []
for i in result:
data = {}
data['title'] = i[0]
url = baseUrl+i[1]
print url
data['url'] = url
data['author'] = i[2]
data['authorUrl'] = baseUrl+i[3]
info.append(data)
getSongSheet(url)
time.sleep(random.randint(1,10))
break
这也是网易云一个有趣的地方,我们在爬取的时候,需要把 # 删了才可这样就可以看到
![](https://upload-images.jianshu.io/upload_images/7933544-ba9a4003bde734ac?imageMogr2/auto-orient/strip|imageView2/2/w/951/format/webp)
** 第二步 **
def getSongSheet(url):
#获取每个歌单里的每首歌的 id,作为接下来 post 获取的关键
html = getHtml(url)
result = re.findall(r'<li><a.*?href="/song\?id=(.*?)">(.*?)</a></li>',html,re.S)
result.pop()
musicList = []
for i in result:
data = {}
headers1 = {'Referer': 'https://music.163.com/song?id={}'.format(i[0]),
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}
musicUrl = baseUrl+'/song?id='+i[0]
print musicUrl
#歌曲 url
data['musicUrl'] = musicUrl
#歌曲名
data['title'] = i[1]
musicList.append(data)
postUrl = 'https://music.163.com/weapi/v1/resource/comments/R_SO_4_{}?csrf_token='.format(i[0])
param = {'params': get_params(1),
'encSecKey': get_encSecKey()}
r = requests.post(postUrl,data = param,headers = headers1)
total = r.json()
# 总评论数
total = int(total['total'])
comment_TatalPage = total/20
# 基础总页数
print comment_TatalPage
#判断评论页数,有余数则为多一页,整除则正好
if total%20 != 0:
comment_TatalPage = comment_TatalPage+1
comment_data,hotComment_data = getMusicComments(comment_TatalPage, postUrl, headers1)
#存入数据库的时候若出现 ID 重复,那么注意爬下来的数据是否只有一个
saveToMongoDB(str(i[1]),comment_data,hotComment_data)
print 'End!'
else:
comment_data, hotComment_data = getMusicComments(comment_TatalPage, postUrl, headers1)
saveToMongoDB(str(i[1]),comment_data,hotComment_data)
print 'End!'
time.sleep(random.randint(1, 10))
break
根据 id,构造 postUrl 通过对第一页的 post(关于如何 post 得到想要的信息,在后面会讲到),获取评论的总条数,及总页数;
以及调用获取歌曲评论的方法;
第三步
def getMusicComments(comment_TatalPage ,postUrl, headers1):
commentinfo = []
hotcommentinfo = []
# 对每一页评论
for j in range(1, comment_TatalPage + 1):
# 热评只在第一页可抓取
if j == 1:
#获取评论
r = getPostApi(j , postUrl, headers1)
comment_info = r.json()['comments']
for i in comment_info:
com_info = {}
com_info['content'] = i['content']
com_info['author'] = i['user']['nickname']
com_info['likedCount'] = i['likedCount']
commentinfo.append(com_info)
hotcomment_info = r.json()['hotComments']
for i in hotcomment_info:
hot_info = {}
hot_info['content'] = i['content']
hot_info['author'] = i['user']['nickname']
hot_info['likedCount'] = i['likedCount']
hotcommentinfo.append(hot_info)
else:
r = getPostApi(j, postUrl, headers1)
comment_info = r.json()['comments']
for i in comment_info:
com_info = {}
com_info['content'] = i['content']
com_info['author'] = i['user']['nickname']
com_info['likedCount'] = i['likedCount']
commentinfo.append(com_info)
print u'第'+str(j)+u'页爬取完毕...'
time.sleep(random.randint(1,10))
print commentinfo
print '\n-----------------------------------------------------------\n'
print hotcommentinfo
return commentinfo,hotcommentinfo
正文完