亚马逊网站用户的评论能直观的反映以后商品值不值得购买,评分信息也能获取到做一个评分的权重。
亚马逊的评论区由用户ID,评分及评论题目,地区工夫,评论注释 这几个局部组成,本次获取的内容就是这些。
测试链接:https://www.amazon.it/product...
一、剖析亚马逊的评论申请
首先关上开发者模式的Network,Clear清屏做一次申请:
你会发现在Doc中的get申请正好就有咱们想要的评论信息。可是真正的评论数据可不是全副都在这里的,页面往下翻,有个翻页的button:
点击翻页申请下一页,在Fetch/XHR选项卡中多了一个新的申请,方才的Doc选项卡中并无新的get申请。这下发现了所有的评论信息是XHR类型的申请。
获取到post申请的链接和payload数据,外面含有管制翻页的参数,真正的评论申请曾经找到了。
这一堆就是未解决的信息,这些申请未解决的信息外面,带有data-hook=\"review\"的就是带有评论的信息。剖析结束,上面开始一步一步去写申请。
二、获取亚马逊评论的内容
首先拼凑申请所需的post参数,申请链接,以便之后的主动翻页,而后带参数post申请链接:
1.headers = { 2. 'authority': 'www.amazon.it', 3. "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 4. "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36", 5.} 6. 7.page = 1 8.post_data = { 9. "sortBy": "recent", 10. "reviewerType": "all_reviews", 11. "formatType": "", 12. "mediaType": "", 13. "filterByStar": "", 14. "filterByLanguage": "", 15. "filterByKeyword": "", 16. "shouldAppend": "undefined", 17. "deviceType": "desktop", 18. "canShowIntHeader": "undefined", 19. "pageSize": "10", 20. "asin": "B08GHGTGQ2", 21.} 22.# 翻页要害payload参数赋值 23.post_data["pageNumber"] = page, 24.post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{page}", 25.post_data["scope"] = f"reviewsAjax{page}", 26.# 翻页链接赋值 27.spiderurl=f'https://www.amazon.it/hz/reviewsrender/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{page}' 28.res = requests.post(spiderurl,headers=headers,data=post_data) 29.if res and res.status_code == 200: 30. res = res.content.decode('utf-8') 31. print(res)
当初曾经获取到了这一堆未解决的信息,接下来开始对这些数据进行解决。
三、亚马逊评论信息的解决
上图的信息会发现,每一段的信息都由“&&&”进行分隔,而分隔之后的每一条信息都是由'","'分隔开的:
所以用python的split办法进行解决,把字符串分隔成list列表:
1.# 返回值字符串解决 2.contents = res.split('&&&') 3.for content in contents: 4. infos = content.split('","')
由'","'分隔的数据通过split解决生成新的list列表,评论内容是列表的最初一个元素,去掉外面的"\","\n"和多余的符号,就能够通过css/xpath抉择其进行解决了:
1.for content in contents: 2. infos = content.split('","') 3. info = infos[-1].replace('"]','').replace('\\n','').replace('\\','') 4. # 评论内容判断 5. if 'data-hook="review"' in info: 6. sel = Selector(text=info) 7. data = {} 8. data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() #用户名 9. data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() #评分 10. data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() #日期地址 11. data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() #评估题目 12. data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() #评估内容 13. image = sel.xpath('div[@class="review-image-tile-section"]').extract_first() 14. data['image'] = image if image else "not image" #图片 15. print(data)
四、代码整合
4.1 代理设置
稳固的IP代理是你数据获取最无力的工具。目前国内还是无奈稳固的拜访亚马逊,会呈现连贯失败的状况。我这里应用的ipidea代理申请的意大利地区的亚马逊,能够通过账密和api获取代理,速度还是十分稳固的。
地址:http://www.ipidea.net/
上面的代理获取的办法:
1.# api获取ip 2.def getApiIp(self): 3. # 获取且仅获取一个ip------意大利 4. api_url = '获取代理地址' 5. res = requests.get(api_url, timeout=5) 6. try: 7. if res.status_code == 200: 8. api_data = res.json()['data'][0] 9. proxies = { 10. 'http': 'http://{}:{}'.format(api_data['ip'], api_data['port']), 11. 'https': 'http://{}:{}'.format(api_data['ip'], api_data['port']), 12. } 13. print(proxies) 14. return proxies 15. else: 16. print('获取失败') 17. except: print('获取失败')
4.2 while循环翻页
while循环进行翻页,评论最大页数是99页,99页之后就break跳出while循环:
1.def getPLPage(self): 2. while True: 3. # 翻页要害payload参数赋值 4. self.post_data["pageNumber"]= self.page, 5. self.post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{self.page}", 6. self.post_data["scope"] = f"reviewsAjax{self.page}", 7. # 翻页链接赋值 8. spiderurl = f'https://www.amazon.it/hz/reviews-render/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{self.page}' 9. res = self.getRes(spiderurl,self.headers,'',self.post_data,'POST',check)#本人封装的申请办法 10. if res: 11. res = res.content.decode('utf-8') 12. # 返回值字符串解决 13. contents = res.split('&&&') 14. for content in contents: 15. infos = content.split('","') 16. info = infos[-1].replace('"]','').replace('\\n','').replace('\\','') 17. # 评论内容判断 18. if 'data-hook="review"' in info: 19. sel = Selector(text=info) 20. data = {} 21. data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() #用户名 22. data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() #评分 23. data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() #日期地址 24. data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() #评估题目 25. data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() #评估内容 26. image = sel.xpath('div[@class="review-image-tile-section"]').extract_first() 27. data['image'] = image if image else "not image" #图片 28. print(data) 29. if self.page <= 99: 30. print('Next Page') 31. self.page += 1 32. else: break
最初的整合代码:
1.# coding=utf-8 2.import requests 3.from scrapy import Selector 4. 5.class getReview(): 6. page = 1 7. headers = { 8. 'authority': 'www.amazon.it', 9. "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 10. "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36", 11. } 12. post_data = { 13. "sortBy": "recent", 14. "reviewerType": "all_reviews", 15. "formatType": "", 16. "mediaType": "", 17. "filterByStar": "", 18. "filterByLanguage": "", 19. "filterByKeyword": "", 20. "shouldAppend": "undefined", 21. "deviceType": "desktop", 22. "canShowIntHeader": "undefined", 23. "pageSize": "10", 24. "asin": "B08GHGTGQ2", 25. } 26. #post_data中asin参数目前写死在 27. #"https://www.amazon.it/product-reviews/B08GHGTGQ2?ie=UTF8&pageNumber=1&reviewerType=all_reviews&pageSize=10&sortBy=recent" 28. #这个链接里,不排除asin值变动的可能,如要获取get申请即可 29. 30. def getPLPage(self): 31. while True: 32. # 翻页要害payload参数赋值 33. self.post_data["pageNumber"]= self.page, 34. self.post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{self.page}", 35. self.post_data["scope"] = f"reviewsAjax{self.page}", 36. # 翻页链接赋值 37. spiderurl = f'https://www.amazon.it/hz/reviews-render/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{self.page}' 38. res = self.getRes(spiderurl,self.headers,'',self.post_data,'POST',check)#本人封装的申请办法 39. if res: 40. res = res.content.decode('utf-8') 41. # 返回值字符串解决 42. contents = res.split('&&&') 43. for content in contents: 44. infos = content.split('","') 45. info = infos[-1].replace('"]','').replace('\\n','').replace('\\','') 46. # 评论内容判断 47. if 'data-hook="review"' in info: 48. sel = Selector(text=info) 49. data = {} 50. data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() #用户名 51. data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() #评分 52. data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() #日期地址 53. data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() #评估题目 54. data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() #评估内容 55. image = sel.xpath('div[@class="review-image-tile-section"]').extract_first() 56. data['image'] = image if image else "not image" #图片 57. print(data) 58. if self.page <= 99: 59. print('Next Page') 60. self.page += 1 61. else: 62. break 63. 64. # api获取ip 65. def getApiIp(self): 66. # 获取且仅获取一个ip------意大利 67. api_url = '获取代理地址' 68. res = requests.get(api_url, timeout=5) 69. try: 70. if res.status_code == 200: 71. api_data = res.json()['data'][0] 72. proxies = { 73. 'http': 'http://{}:{}'.format(api_data['ip'], api_data['port']), 74. 'https': 'http://{}:{}'.format(api_data['ip'], api_data['port']), 75. } 76. print(proxies) 77. return proxies 78. else: 79. print('获取失败') 80. except: 81. print('获取失败') 82. 83. #专门发送申请的办法,代理申请三次,三次失败返回谬误 84. def getRes(self,url,headers,proxies,post_data,method): 85. if proxies: 86. for i in range(3): 87. try: 88. # 传代理的post申请 89. if method == 'POST': 90. res = requests.post(url,headers=headers,data=post_data,proxies=proxies) 91. # 传代理的get申请 92. else: 93. res = requests.get(url, headers=headers,proxies=proxies) 94. if res: 95. return res 96. except: 97. print(f'第{i+1}次申请出错') 98. else: 99. return None 100. else: 101. for i in range(3): 102. proxies = self.getApiIp() 103. try: 104. # 申请代理的post申请 105. if method == 'POST': 106. res = requests.post(url, headers=headers, data=post_data, proxies=proxies) 107. # 申请代理的get申请 108. else: 109. res = requests.get(url, headers=headers, proxies=proxies) 110. if res: 111. return res 112. except: 113. print(f"第{i+1}次申请出错") 114. else: 115. return None 116. 117.if __name__ == '__main__': 118. getReview().getPLPage()
总结
本次的亚马逊评论获取就是两个坑,一是评论信息通过的XHR申请形式,二是评论信息的解决。剖析之后这次的数据获取还是非常简单的,找到正确的申请形式,稳固的IP代理让你事倍功半,找到信息的共同点进行解决,问题就迎刃而解了。