因为版权不明,以下所有的那个网站用S代替,那个第三方网站用SDB代替。
在之前的文章中爬取了S的热销商品,也阐明了因为Cloudflare的浏览器验证导致SDB无奈爬取,连selenium也不行。过后我就放弃了。
然而前一段时间,有一个伙计给我讲:【用playwright啊!】
playwright反对多种语法,相比于selenium的Http协定,playwright的Websocket获取浏览器状况会更好一点。
Playwright的应用
装置:须要python3.7+
1.pip install --upgrade pip 2.pip install playwright 3.playwright install
一次装置,Playwright就能够通过开发者工具与你装置的浏览器 (chromiun, firefox and webkit)进行交互,不像selenium下载对应浏览器版本的Driver了。
本次只讲一下最根本的页面获取,其余的性能各位自行查看文档吧:
1.from playwright.sync_api import sync_playwright 2. 3.with sync_playwright() as p: 4. browser = p.webkit.launch() 5. page = browser.new_page() 6. page.goto("http://whatsmyuseragent.org/") 7. page.wait_for_load_state('networkidle') 8. html = page.content() 9. browser.close()
html就是曾经加载好的注释的内容,获取到的货色就能够交给选择器去解决跟筛选了。
SDB实例
SDB,是一个第三方的S的数据库,在线人数,游戏价格等等都能查问的到。
在线人数,在线排名都有展现进去。当初开始获取数据。
1.with sync_playwright() as p: 2. try: 3. browser = p.chromium.launch(headless=False) 4. page = browser.new_page() 5. page.goto('https://steamdb.info/graph/') 6. page.wait_for_load_state('networkidle') 7. html = page.content() 8. browser.close() 9. return html 10. except Exception as e: 11. print(e)
不晓得为什么开启无头就是通过不了。这样的话本页的html内容就获取下来了:
接下来用选择器进行内容解析:
1.sel = Selector(text=content) 2.nodes = sel.css('#table-apps .app') 3.for node in nodes: 4. title = node.css('td:nth-child(3) a::text').extract_first() 5. current = node.css('td:nth-child(4)::text').extract_first() 6. peakToday = node.css('td:nth-child(5)::text').extract_first() 7. allPeak = node.css('td:nth-child(6)::text').extract_first() 8. print(f"游戏:{title},目前在线:{current},今日最高在线:{peakToday},历史最高在线:{allPeak}")
Playwright的代理配置
Playwright配置代理其实很简略,要在浏览器配置那一行加上proxy参数就能够了:
browser = p.chromium.launch(headless=False,proxy=proxy)
自己目前应用的是ipidea的代理,新用户能够白嫖流量哦。SDB国内拜访慢,用上稳固的代理配上好用的playwright,事倍功半!
地址:http://www.ipidea.net/
代码整合
1.from playwright.sync_api import sync_playwright 2.from parsel import Selector 3. 4.def getSteaminfo(): 5. 6. proxy = { 7. 'server': "", 8. 'username': "", 9. 'password': '' 10. } 11. 12. with sync_playwright() as p: 13. try: 14. browser = p.chromium.launch(headless=False,proxy=proxy) 15. page = browser.new_page() 16. page.goto('https://steamdb.info/graph/') 17. page.wait_for_load_state('networkidle') 18. html = page.content() 19. browser.close() 20. return html 21. except Exception as e: 22. print(e) 23. 24.def start(): 25. content = getSteaminfo() 26. sel = Selector(text=content) 27. nodes = sel.css('#table-apps .app') 28. for node in nodes: 29. title = node.css('td:nth-child(3) a::text').extract_first() 30. current = node.css('td:nth-child(4)::text').extract_first() 31. peakToday = node.css('td:nth-child(5)::text').extract_first() 32. allPeak = node.css('td:nth-child(6)::text').extract_first() 33. print(f"游戏:{title},目前在线:{current},今日最高在线:{peakToday},历史最高在线:{allPeak}") 34. 35.if __name__ == '__main__': start()