工夫:2021年4月1号,文中各版本号以该工夫为背景
问题一、网页采纳gb2312编码,爬取后中文全副乱码
环境:node@8.12.0,cheerio@0.22.0
网站应用的gb2312编码,开始用http间接拜访网页,cheerio加载后console进去中文全副乱码:
const http = require('http')const cheerio = require('cheerio')const baseUrl = '******'http.get(baseUrl, res => { let html = '' res.on('data', data => { html += data }) res.on('end', () => { downloadHandler(html) })}).on('error', () => { console.log('出错了')})function downloadHandler(html) { const $ = cheerio.load(html) // 默认解析 console.log($.html());}
cheerio解析:
不解析:
function downloadHandler(html) { const $ = cheerio.load(html,{ decodeEntities: false }) // 不解析 console.log($.html());}
起因:Node不反对gb2312
解决:应用superagent取代http,同时应用superagent-charset解决编码问题
const request = require('superagent')require('superagent-charset')(request)const cheerio = require('cheerio')const baseUrl = '******'request.get(baseUrl) .buffer(true) .charset('gbk') .end((err, html) => { downloadHandler(html) })function downloadHandler(html) { const htmlText = html.text const $ = cheerio.load(htmlText,{ decodeEntities: false })}
问题二、一个循环外部,每次循环都会有一个异步申请;循环内部须要等外部所有申请返回后果后再执行
解决:
const idList = [1,2,3,4]getData(idList).then(data => { // get data})function getData(idList) { let asyncPool = [] idList.forEach(id => { asyncPool.push((() => { return new Promise((resolve,reject) => { return request.get(`http://detail/${id}`) .buffer(true) .charset('gbk') .then(html => { return Promise.reslove(html) }) }) })()) }) return Promise.all(asyncPool).then(data => { return data })}
问题三、运行puppeteer报错:unexpected token {
环境:node@8.12.0,npm@6
起因:node版本过低
解决:降级node版本至最新稳定版14.16.0,重新安装puppeteer
问题四、运行puppeteer报错:could not find expected browser
环境:node@14.16.0,npm@6
降级好node,重新安装puppeteer。我亲眼看见他下载了版本号如同是865什么的chromium,而后运行代码puppeteer又报错说在本地找不到865版本的浏览器(明明方才下载了,且版本号统一)。
最终在官网issues里找到了可能的答案:
https://github.com/puppeteer/puppeteer/issues/6586
起因:npm版本问题。貌似是npm6下载puppeteer时,理论下载的浏览器版本和puppeteer须要的并不统一
解决:降级npm至7,重新安装puppeteer,失常运行
问题五、降级node后,cheerio无奈失常运行,报错:content.forEach() is not a function
环境:node@14.16.0,npm@7,cheerio@v1.0.0-rc5
降级了node和npm至最新稳定版后,把之前的包也全副重新安装了一遍
cheerio忽然始终报错无奈运行,之前是失常的。
还是在官网issues里找到了答案:
https://github.com/cheeriojs/cheerio/issues/1591
起因:cheerio版本问题
Cheerio.load expects there to be array. Older versions was there condition for checking, if it is not array. Seems like it is lost due to optimizations.
older version wraps it into array