关于golang:Go-每日一库之-colly

简介

colly是用 Go 语言编写的功能强大的爬虫框架。它提供简洁的 API，领有强劲的性能，能够主动解决 cookie&session，还有提供灵便的扩大机制。

首先，咱们介绍colly的基本概念。而后通过几个案例来介绍colly的用法和个性：拉取 GitHub Treading，拉取百度小说热榜，下载 Unsplash 网站上的图片。

疾速应用

本文代码应用 Go Modules。

创立目录并初始化：

$ mkdir colly && cd colly$ go mod init github.com/darjun/go-daily-lib/colly

装置colly库：

$ go get -u github.com/gocolly/colly/v2

应用：

package mainimport (  "fmt"  "github.com/gocolly/colly/v2")func main() {  c := colly.NewCollector(    colly.AllowedDomains("www.baidu.com" ),  )  c.OnHTML("a[href]", func(e *colly.HTMLElement) {    link := e.Attr("href")    fmt.Printf("Link found: %q -> %s\n", e.Text, link)    c.Visit(e.Request.AbsoluteURL(link))  })  c.OnRequest(func(r *colly.Request) {    fmt.Println("Visiting", r.URL.String())  })  c.OnResponse(func(r *colly.Response) {    fmt.Printf("Response %s: %d bytes\n", r.Request.URL, len(r.Body))  })  c.OnError(func(r *colly.Response, err error) {    fmt.Printf("Error %s: %v\n", r.Request.URL, err)  })  c.Visit("http://www.baidu.com/")}

colly的应用比较简单：

首先，调用colly.NewCollector()创立一个类型为*colly.Collector的爬虫对象。因为每个网页都有很多指向其余网页的链接。如果不加限度的话，运行可能永远不会进行。所以下面通过传入一个选项colly.AllowedDomains("www.baidu.com")限度只爬取域名为www.baidu.com的网页。

而后咱们调用c.OnHTML办法注册HTML回调，对每个有href属性的a元素执行回调函数。这里持续拜访href指向的 URL。也就是说解析爬取到的网页，而后持续拜访网页中指向其余页面的链接。

调用c.OnRequest()办法注册申请回调，每次发送申请时执行该回调，这里只是简略打印申请的 URL。

调用c.OnResponse()办法注册响应回调，每次收到响应时执行该回调，这里也只是简略的打印 URL 和响应大小。

调用c.OnError()办法注册谬误回调，执行申请产生谬误时执行该回调，这里简略打印 URL 和错误信息。

最初咱们调用c.Visit()开始拜访第一个页面。

运行：

$ go run main.goVisiting http://www.baidu.com/Response http://www.baidu.com/: 303317 bytesLink found: "百度首页" -> /Link found: "设置" -> javascript:;Link found: "登录" -> https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5Link found: "新闻" -> http://news.baidu.comLink found: "hao123" -> https://www.hao123.comLink found: "地图" -> http://map.baidu.comLink found: "直播" -> https://live.baidu.com/Link found: "视频" -> https://haokan.baidu.com/?sfrom=baidu-topLink found: "贴吧" -> http://tieba.baidu.com...

colly爬取到页面之后，会应用goquery解析这个页面。而后查找注册的 HTML 回调对应元素选择器（element-selector），将goquery.Selection封装成一个colly.HTMLElement执行回调。

colly.HTMLElement其实就是对goquery.Selection的简略封装：

type HTMLElement struct {  Name string  Text string  Request *Request  Response *Response  DOM *goquery.Selection  Index int}

并提供了简略易用的办法：

Attr(k string)：返回以后元素的属性，下面示例中咱们应用e.Attr("href")获取了href属性；
ChildAttr(goquerySelector, attrName string)：返回goquerySelector抉择的第一个子元素的attrName属性；
ChildAttrs(goquerySelector, attrName string)：返回goquerySelector抉择的所有子元素的attrName属性，以[]string返回；
ChildText(goquerySelector string)：拼接goquerySelector抉择的子元素的文本内容并返回；
ChildTexts(goquerySelector string)：返回goquerySelector抉择的子元素的文本内容组成的切片，以[]string返回。
ForEach(goquerySelector string, callback func(int, *HTMLElement))：对每个goquerySelector抉择的子元素执行回调callback；
Unmarshal(v interface{})：通过给构造体字段指定 goquerySelector 格局的 tag，能够将一个 HTMLElement 对象 Unmarshal 到一个构造体实例中。

这些办法会被频繁地用到。上面咱们就通过一些示例来介绍colly的个性和用法。

GitHub Treading

我之前写过一个拉取GitHub Treading 的 API，用colly更不便：

type Repository struct {  Author  string  Name    string  Link    string  Desc    string  Lang    string  Stars   int  Forks   int  Add     int  BuiltBy []string}func main() {  c := colly.NewCollector(    colly.MaxDepth(1),  )  repos := make([]*Repository, 0, 15)  c.OnHTML(".Box .Box-row", func (e *colly.HTMLElement) {    repo := &Repository{}    // author & repository name    authorRepoName := e.ChildText("h1.h3 > a")    parts := strings.Split(authorRepoName, "/")    repo.Author = strings.TrimSpace(parts[0])    repo.Name = strings.TrimSpace(parts[1])    // link    repo.Link = e.Request.AbsoluteURL(e.ChildAttr("h1.h3 >a", "href"))    // description    repo.Desc = e.ChildText("p.pr-4")    // language    repo.Lang = strings.TrimSpace(e.ChildText("div.mt-2 > span.mr-3 > span[itemprop]"))    // star & fork    starForkStr := e.ChildText("div.mt-2 > a.mr-3")    starForkStr = strings.Replace(strings.TrimSpace(starForkStr), ",", "", -1)    parts = strings.Split(starForkStr, "\n")    repo.Stars , _=strconv.Atoi(strings.TrimSpace(parts[0]))    repo.Forks , _=strconv.Atoi(strings.TrimSpace(parts[len(parts)-1]))    // add    addStr := e.ChildText("div.mt-2 > span.float-sm-right")    parts = strings.Split(addStr, " ")    repo.Add, _ = strconv.Atoi(parts[0])    // built by    e.ForEach("div.mt-2 > span.mr-3  img[src]", func (index int, img *colly.HTMLElement) {      repo.BuiltBy = append(repo.BuiltBy, img.Attr("src"))    })    repos = append(repos, repo)  })  c.Visit("https://github.com/trending")    fmt.Printf("%d repositories\n", len(repos))  fmt.Println("first repository:")  for _, repo := range repos {      fmt.Println("Author:", repo.Author)      fmt.Println("Name:", repo.Name)      break  }}

咱们用ChildText获取作者、仓库名、语言、星数和 fork 数、今日新增等信息，用ChildAttr获取仓库链接，这个链接是一个相对路径，通过调用e.Request.AbsoluteURL()办法将它转为一个绝对路径。

运行：

$ go run main.go25 repositoriesfirst repository:Author: ShopifyName: dawn

百度小说热榜

网页构造如下：

各局部构造如下：

每条热榜各自在一个div.category-wrap_iQLoo中；
a元素下div.index_1Ew5p是排名；
内容在div.content_1YWBm中；
内容中a.title_dIF3B是题目；
内容中两个div.intro_1l0wp，前一个是作者，后一个是类型；
内容中div.desc_3CTjT是形容。

由此咱们定义构造：

type Hot struct {  Rank   string `selector:"a > div.index_1Ew5p"`  Name   string `selector:"div.content_1YWBm > a.title_dIF3B"`  Author string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(2)"`  Type   string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(3)"`  Desc   string `selector:"div.desc_3CTjT"`}

tag 中是 CSS 选择器语法，增加这个是为了能够间接调用HTMLElement.Unmarshal()办法填充Hot对象。

而后创立Collector对象：

c := colly.NewCollector()

注册回调：

c.OnHTML("div.category-wrap_iQLoo", func(e *colly.HTMLElement) {  hot := &Hot{}  err := e.Unmarshal(hot)  if err != nil {    fmt.Println("error:", err)    return  }  hots = append(hots, hot)})c.OnRequest(func(r *colly.Request) {  fmt.Println("Requesting:", r.URL)})c.OnResponse(func(r *colly.Response) {  fmt.Println("Response:", len(r.Body))})

OnHTML对每个条目执行Unmarshal生成Hot对象。

OnRequest/OnResponse只是简略输入调试信息。

而后，调用c.Visit()拜访网址：

err := c.Visit("https://top.baidu.com/board?tab=novel")if err != nil {  fmt.Println("Visit error:", err)  return}

最初增加一些调试打印：

fmt.Printf("%d hots\n", len(hots))for _, hot := range hots {  fmt.Println("first hot:")  fmt.Println("Rank:", hot.Rank)  fmt.Println("Name:", hot.Name)  fmt.Println("Author:", hot.Author)  fmt.Println("Type:", hot.Type)  fmt.Println("Desc:", hot.Desc)  break}

运行输入：

Requesting: https://top.baidu.com/board?tab=novelResponse: 11808330 hotsfirst hot:Rank: 1Name: 逆天邪神Author: 作者：火星引力Type: 类型：玄幻Desc: 掌天毒之珠，承邪神之血，修逆天之力，一代邪神，君临天下！  查看更多>

Unsplash

我写公众号文章，背景图片根本都是从 unsplash 这个网站获取。unsplash 提供了大量的、丰盛的、收费的图片。这个网站有个问题，就是访问速度比较慢。既然学习爬虫，刚好利用程序主动下载图片。

unsplash 首页如下图所示：

网页构造如下：

然而首页上显示的都是尺寸较小的图片，咱们点开某张图片的链接：

网页构造如下：

因为波及三层网页构造（img最初还须要拜访一次），应用一个colly.Collector对象，OnHTML回调设置须要分外小心，给编码带来比拟大的心智累赘。colly反对多个Collector，咱们采纳这种形式来编码：

func main() {  c1 := colly.NewCollector()  c2 := c1.Clone()  c3 := c1.Clone()  c1.OnHTML("figure[itemProp] a[itemProp]", func(e *colly.HTMLElement) {    href := e.Attr("href")    if href == "" {      return    }    c2.Visit(e.Request.AbsoluteURL(href))  })  c2.OnHTML("div._1g5Lu > img[src]", func(e *colly.HTMLElement) {    src := e.Attr("src")    if src == "" {      return    }    c3.Visit(src)  })  c1.OnRequest(func(r *colly.Request) {    fmt.Println("Visiting", r.URL)  })  c1.OnError(func(r *colly.Response, err error) {    fmt.Println("Visiting", r.Request.URL, "failed:", err)  })}

咱们应用 3 个Collector对象，第一个Collector用于收集首页上对应的图片链接，而后应用第二个Collector去拜访这些图片链接，最初让第三个Collector去下载图片。下面咱们还为第一个Collector注册了申请和谬误回调。

第三个Collector下载到具体的图片内容后，保留到本地：

func main() {  // ... 省略  var count uint32  c3.OnResponse(func(r *colly.Response) {    fileName := fmt.Sprintf("images/img%d.jpg", atomic.AddUint32(&count, 1))    err := r.Save(fileName)    if err != nil {      fmt.Printf("saving %s failed:%v\n", fileName, err)    } else {      fmt.Printf("saving %s success\n", fileName)    }  })  c3.OnRequest(func(r *colly.Request) {    fmt.Println("visiting", r.URL)  })}

下面应用atomic.AddUint32()为图片生成序号。

运行程序，爬取后果：

异步

默认状况下，colly爬取网页是同步的，即爬完一个接着爬另一个，下面的 unplash 程序就是如此。这样须要很长时间，colly提供了异步爬取的个性，咱们只须要在结构Collector对象时传入选项colly.Async(true)即可开启异步：

c1 := colly.NewCollector(  colly.Async(true),)

然而，因为是异步爬取，所以程序最初须要期待Collector解决实现，否则早早地退出main，程序会退出：

c1.Wait()c2.Wait()c3.Wait()

再次运行，速度快了很多。

第二版

向下滑动 unsplash 的网页，咱们发现前面的图片是异步加载的。滚动页面，通过 chrome 浏览器的 network 页签查看申请：

申请门路/photos，设置per_page和page参数，返回的是一个 JSON 数组。所以有了另一种形式：

定义每一项的构造体，咱们只保留必要的字段：

type Item struct {  Id     string  Width  int  Height int  Links  Links}type Links struct {  Download string}

而后在OnResponse回调中解析 JSON，对每一项的Download链接调用负责下载图像的Collector的Visit()办法：

c.OnResponse(func(r *colly.Response) {  var items []*Item  json.Unmarshal(r.Body, &items)  for _, item := range items {    d.Visit(item.Links.Download)  }})

初始化拜访，咱们设置拉取 3 页，每页 12 个（和页面申请的个数统一）：

for page := 1; page <= 3; page++ {  c.Visit(fmt.Sprintf("https://unsplash.com/napi/photos?page=%d&per_page=12", page))}

运行，查看下载的图片：

限速

有时候并发申请太多，网站会限度拜访。这时就须要应用LimitRule了。说白了，LimitRule就是限度访问速度和并发量的：

type LimitRule struct {  DomainRegexp string  DomainGlob string  Delay time.Duration  RandomDelay time.Duration  Parallelism    int}

罕用的就Delay/RandomDelay/Parallism这几个，别离示意申请与申请之间的提早，随机提早，和并发数。另外必须指定对哪些域名实施限度，通过DomainRegexp或DomainGlob设置，如果这两个字段都未设置Limit()办法会返回谬误。用在下面的例子中：

err := c.Limit(&colly.LimitRule{  DomainRegexp: `unsplash\.com`,  RandomDelay:  500 * time.Millisecond,  Parallelism:  12,})if err != nil {  log.Fatal(err)}

咱们设置针对unsplash.com这个域名，申请与申请之间的随机最大提早 500ms，最多同时并发 12 个申请。

设置超时

有时候网速较慢，colly中应用的http.Client有默认超时机制，咱们能够通过colly.WithTransport()选项改写：

c.WithTransport(&http.Transport{  Proxy: http.ProxyFromEnvironment,  DialContext: (&net.Dialer{    Timeout:   30 * time.Second,    KeepAlive: 30 * time.Second,  }).DialContext,  MaxIdleConns:          100,  IdleConnTimeout:       90 * time.Second,  TLSHandshakeTimeout:   10 * time.Second,  ExpectContinueTimeout: 1 * time.Second,})

扩大

colly在子包extension中提供了一些扩大个性，最最罕用的就是随机 User-Agent 了。通常网站会通过 User-Agent 辨认申请是否是浏览器收回的，爬虫个别会设置这个 Header 把本人伪装成浏览器。应用也比较简单：

import "github.com/gocolly/colly/v2/extensions"func main() {  c := colly.NewCollector()  extensions.RandomUserAgent(c)}

随机 User-Agent 实现也很简略，就是从一些事后定义好的 User-Agent 数组中随机一个设置到 Header 中：

func RandomUserAgent(c *colly.Collector) {  c.OnRequest(func(r *colly.Request) {    r.Headers.Set("User-Agent", uaGens[rand.Intn(len(uaGens))]())  })}

实现本人的扩大也不难，例如咱们每次申请时须要设置一个特定的 Header，扩大能够这么写：

func MyHeader(c *colly.Collector) {  c.OnRequest(func(r *colly.Request) {    r.Headers.Set("My-Header", "dj")  })}

用Collector对象调用MyHeader()函数即可：

MyHeader(c)

总结

colly是 Go 语言中最风行的爬虫框架，反对丰盛的个性。本文对一些罕用个性做了介绍，并辅之以实例。限于篇幅，一些高级个性未能波及，例如队列，存储等。对爬虫感兴趣的可去深刻理解。

大家如果发现好玩、好用的 Go 语言库，欢送到 Go 每日一库 GitHub 上提交 issue

参考

Go 每日一库 GitHub：https://github.com/darjun/go-daily-lib
Go 每日一库之 goquery：https://darjun.github.io/2020/10/11/godailylib/goquery/
用 Go 实现一个 GitHub Trending API：https://darjun.github.io/2021/06/16/github-trending-api/
colly GitHub：https://github.com/gocolly/colly

我

我的博客：https://darjun.github.io

欢送关注我的微信公众号【GoUpUp】，独特学习，一起提高~