李乐

问题引入

  生产环境Golang服务有时会产生502报警,排查发现大多是以下三种起因造成的:

  1. http.Server配置了WriteTimeout,申请解决超时,Golang断开连接;
  2. http.Server配置了IdleTimeout,且网关和Golang之间应用长连贯,Golang断开连接;
  3. Golang服务呈现了panic。

  第三种case非常简单,本文将重点剖析前两种case背地的深层起因。

  注:申请链路为 客户端 ===> Nginx ===> Golang

WriteTimeout

  Golang sdk的正文阐明为 "WriteTimeout is the maximum duration before timing out writes of the response" 。http.Server在读取客户端申请实现时,设置了写超时工夫:

func (c *conn) readRequest(ctx context.Context) (w *response, err error) {    if d := c.server.WriteTimeout; d != 0 {        defer func() {            c.rwc.SetWriteDeadline(time.Now().Add(d))        }()    }}

  显然,当申请解决工夫超过WriteTimeout,会产生超时景象。为什么超时后会呈现502呢?

一个小小的试验

  咱们先模仿以下申请解决超时的景象:

package mainimport (    "net/http"    "time")type GinHandler struct {}func (* GinHandler) ServeHTTP(w http.ResponseWriter, r *http.Request){    time.Sleep(time.Duration(5) * time.Second)    w.Write([]byte("hello golang"))}func main() {    server := &http.Server{        Addr:"0.0.0.0:8080",        Handler: &GinHandler{},        ReadTimeout: time.Second * 3,        WriteTimeout: time.Second *3,    }    server.ListenAndServe()}

  申请后果如下:

time  curl http://127.0.0.1/test -H "Host:test.xueersi.com"502 Bad Gatewayreal    0m5.011s

  查看Nginx的谬误日志,能够看到是上游被动敞开连贯造成的:

2020/08/12 21:18:07 [error] 30217#0: *8105 upstream prematurely closed connection while reading response header from upstream, client: 127.0.0.1, server: test.xueersi.com, request: "GET /test HTTP/1.1", upstream: "http://127.0.0.1:8080/test", host: "test.xueersi.com"

  通过tcpdump再抓包验证下;能够看到,35秒承受到客户端申请,40秒时,服务端被动断开了连贯。

20:52:35.660325 IP 127.0.0.1.31227 > 127.0.0.1.8080: Flags [S], seq 3442614457, win 43690, length 014:05:54.073175 IP 127.0.0.1.8080 > 127.0.0.1.31227: Flags [S.], seq 3655119156, ack 3442614458, win 43690, length 020:52:35.660353 IP 127.0.0.1.31227 > 127.0.0.1.8080: Flags [.], ack 1, win 86, length 020:52:35.660415 IP 127.0.0.1.31227 > 127.0.0.1.8080: Flags [P.], seq 1:106, ack 1, win 86, length 105: HTTP: GET /test HTTP/1.020:52:35.660425 IP 127.0.0.1.8080 > 127.0.0.1.31227: Flags [.], ack 106, win 86, length 020:52:40.662303 IP 127.0.0.1.8080 > 127.0.0.1.31227: Flags [F.], seq 1, ack 106, win 86, length 020:52:40.662558 IP 127.0.0.1.31227 > 127.0.0.1.8080: Flags [F.], seq 106, ack 2, win 86, length 020:52:40.662605 IP 127.0.0.1.8080 > 127.0.0.1.31227: Flags [.], ack 107, win 86, length 0

  看到这里有个纳闷,5秒后Golang服务才断开连接的,咱们不是设置了WriteTimeout=3秒吗?为什么3秒超时一直开,而是5秒后才断开?

WriteTimeout到底做了什么

  net.http.(c *conn).serve办法是Golang sdk解析解决HTTP申请的主函数,咱们能够找到WriteTimeout的设置入口,c.rwc.SetWriteDeadline。这里的rwc类型为net.Conn,是一个接口。真正的对象是由l.Accept()返回的,而l是对象TCPListener。往下追踪能够发现创立的是net.TCPConn对象,而该对象继承了net.conn,net.conn又实现了net.Conn接口(留神大小写)。

type TCPConn struct {    conn}type conn struct {    fd *netFD}func (c *conn) SetReadDeadline(t time.Time) error {        if err := c.fd.SetReadDeadline(t); err != nil {            }}

  再持续往下跟踪,调用链是这样的:

net.(c *conn).SetWriteDeadline()    net.(fd *netFD).SetWriteDeadline()        poll.(fd *FD).SetWriteDeadline()            poll.setDeadlineImpl()                poll.runtime_pollSetDeadline()

  追到这,你会发现追不上来了,runtime_pollSetDeadline的实现逻辑是什么呢?咱们全局搜素pollSetDeadline,能够找到在runtime/netpoll.go文件。

//go:linkname poll_runtime_pollSetDeadline internal/poll.runtime_pollSetDeadlinefunc poll_runtime_pollSetDeadline(pd *pollDesc, d int64, mode int) {}

  pollDesc用于封装一个网络描述符,次要有这几个字段须要关注:

type pollDesc struct {    fd      uintptr    rg      uintptr // pdReady, pdWait, G waiting for read or nil    rt      timer   // read deadline timer (set if rt.f != nil)    rd      int64   // read deadline        wg      uintptr // pdReady, pdWait, G waiting for write or nil    wt      timer   // write deadline timer    wd      int64   // write deadline}

  rt是以后描述符上的读定时器,rt为以后描述符已设置读定时器的超时工夫。rg可取三个值:1)pdReady,示意该描述符处于可读状态;2)pdWait,示意有一个协程因为该描述符阻塞筹备换出;3)G,指针,指向阻塞的协程对象。

  函数poll_runtime_pollSetDeadline次要依据读/写超时工夫增加定时器,同时设置回调回调函数:

//写超时回调函数pd.wt.f = netpollWriteDeadline//定时器到期后,调用回调函数的传入参数pd.wt.arg = pd

  咱们的问题行将上不着天;下不着地,只须要剖析回调函数netpollWriteDeadline(对立由netpolldeadlineimpl实现)做了什么,即可明确WriteTimeout到底是什么。

func netpolldeadlineimpl(pd *pollDesc, seq uintptr, read, write bool) {    if write {        pd.wd = -1        atomic.StorepNoWB(unsafe.Pointer(&pd.wt.f), nil)         wg = netpollunblock(pd, 'w', false)    }        if wg != nil {        netpollgoready(wg, 0)    }}

  能够看到,设置pd.wd=-1,后续以此判断该描述符上的写定时器是否曾经过期。netpollunblock与netpollgoready,用于判断是否有协程因为以后描述符阻塞,如果有将其状态置为可运行,并退出可运行队列;否则什么都不做。

  这下所有都明确了,WriteTimeout只是增加了读写超时定时器,待定时器过期时,也仅仅是设置该描述符读写超时。所以哪怕申请解决工夫是5秒远超过WriteTimeout设置的超时工夫,该申请仍然不会被中断,会失常执行直到完结,在向客户端返回后果时才会产生异样。

  该过程参照下图:

  最初咱们通过dlv调试,打印一下设置WriteTimeout的函数调用链。

0  0x0000000001034c23 in internal/poll.runtime_pollSetDeadline   at /usr/local/go/src/runtime/netpoll.go:2241  0x00000000010ee3a0 in internal/poll.setDeadlineImpl   at /usr/local/go/src/internal/poll/fd_poll_runtime.go:1552  0x00000000010ee14a in internal/poll.(*FD).SetReadDeadline   at /usr/local/go/src/internal/poll/fd_poll_runtime.go:1323  0x00000000011cc868 in net.(*netFD).SetReadDeadline   at /usr/local/go/src/net/fd_unix.go:2764  0x00000000011dffca in net.(*conn).SetReadDeadline   at /usr/local/go/src/net/net.go:2515  0x000000000131491f in net/http.(*conn).readRequest   at /usr/local/go/src/net/http/server.go:9536  0x000000000131bd5a in net/http.(*conn).serve   at /usr/local/go/src/net/http/server.go:1822

小小的思考:底层描述符上的读写超时事件触发后,为什么不间接敞开该描述符,只是标识一下呢?

502就是这么来的

  这次咱们简略一点,不必一点一点去追踪了。方才设置超时工夫会调用poll.(fd FD).SetWriteDeadline,FD进一步封装了底层 的描述符pollDesc,其必然提供读写处理函数。咱们在poll.(fd FD).Write打断点:

(dlv) b poll.WriteBreakpoint 1 set at 0x10ef40b for internal/poll.(*FD).Write() /usr/local/go/src/internal/poll/fd_unix.go:254(dlv) bt0  0x00000000010ef40b in internal/poll.(*FD).Write   at /usr/local/go/src/internal/poll/fd_unix.go:2541  0x00000000011cc0ac in net.(*netFD).Write   at /usr/local/go/src/net/fd_unix.go:2202  0x00000000011df765 in net.(*conn).Write   at /usr/local/go/src/net/net.go:1963  0x000000000132276c in net/http.checkConnErrorWriter.Write   at /usr/local/go/src/net/http/server.go:34344  0x0000000001177e91 in bufio.(*Writer).Flush   at /usr/local/go/src/bufio/bufio.go:5915  0x000000000131a329 in net/http.(*response).finishRequest   at /usr/local/go/src/net/http/server.go:15946  0x000000000131c7c5 in net/http.(*conn).serve   at /usr/local/go/src/net/http/server.go:1900

  正如上一节所说,待申请解决实现后,哪怕超时了,下层业务仍然会尝试写响应后果,最终判断发现pollDesc超时,返回谬误,下层业务从而敞开连贯。

poll.(fd *FD).Write    poll.(pd *pollDesc).prepareWrite        poll.(pd *pollDesc).prepare            poll.runtime_pollReset

  函数runtime_pollReset的实现逻辑同样是在runtime/netpoll.go文件。通过netpollcheckerr实现校验逻辑:

//go:linkname poll_runtime_pollReset internal/poll.runtime_pollResetfunc poll_runtime_pollReset(pd *pollDesc, mode int) int {}func netpollcheckerr(pd *pollDesc, mode int32) int {    if pd.closing {        return 1 // ErrFileClosing or ErrNetClosing    }    if (mode == 'r' && pd.rd < 0) || (mode == 'w' && pd.wd < 0) {        return 2 // ErrTimeout    }    ……    return 0}

  底层写数据返回谬误,下层http.(c *conn).serve办法间接返回,返回前执行defer做收尾工作,比方敞开连贯。

超时怎么管制?

  参照下面的剖析,通过WriteTimeout管制申请的超时工夫,存在两个问题:1)申请解决超时后Golang会断开连接,Nginx会呈现502;2)申请不会因为WriteTimeout超时而被中断,须要等到申请真正解决实现,客户端期待响应工夫较长;在遇到大量慢申请时,Golang服务资源占用会急剧减少。

  那么如何优雅的管制超时状况呢?超时后Golang间接完结以后申请,并向客户端返回默认后果。能够利用context.Context实现,这是一个接口,有多种实现,如cancelCtx可勾销的上下文,timerCtx可定时勾销的上下文,valueCtx可基于上下文传递变量。

  context.Context定义如下:

type Context interface {    Deadline() (deadline time.Time, ok bool)    Done() <-chan struct{}    Err() error    Value(key interface{}) interface{}}
  • Deadline:返回绑定以后context的工作被勾销的截止工夫;如果没有设置,则返回ok=false;
  • Done:当绑定以后context的工作被勾销时,将返回一个敞开的channel;否则返回nil;
  • Err:当Done返回的channel被敞开时,返回非空的值示意工作完结的起因;
  • Value 返回context存储的键值对中以后key对应的值,如果没有对应的key,则返回nil。

  上面举一个小小的例子:

ctx := context.Background()ctx, cancel := context.WithCancel(ctx)go Proc()time.Sleep(time.Second)cancel()func Proc(ctx context.Context)  {    for {        select {        case <- ctx.Done():            return         default:            //do something        }    }}

  更多context.Context的介绍参考这两篇文章:

  • 深刻了解Golang之context: https://zhuanlan.zhihu.com/p/...
  • 今日头条Go建千亿级微服务的实际: https://36kr.com/p/1721518997505

小小的扩大

  WriteTimeout导致的502咱们曾经剖析分明了,最初咱们再扩大一下协程因为读写阻塞导致的切换流程。

for {        n, err := syscall.Read(fd.Sysfd, p)        if err != nil {            n = 0            if err == syscall.EAGAIN && fd.pd.pollable() {                if err = fd.pd.waitRead(fd.isFile);             }            ……        }    }        poll.(fd *FD).Read    poll.(pd *pollDesc).waitRead        poll.(pd *pollDesc).wait            poll.runtime_pollWait

  函数runtime_pollWait的实现逻辑同样是在runtime/netpoll.go文件。通过netpollblock实现协程阻塞逻辑

//go:linkname poll_runtime_pollWait internal/poll.runtime_pollWaitfunc poll_runtime_pollWait(pd *pollDesc, mode int) int {}func netpollblock(pd *pollDesc, mode int32, waitio bool) bool {    ……    if waitio || netpollcheckerr(pd, mode) == 0 {        gopark(netpollblockcommit, unsafe.Pointer(gpp), waitReasonIOWait, traceEvGoBlockNet, 5)    }}

IdleTimeout + 长连贯

  Golang sdk的正文阐明如下:

  "IdleTimeout is the maximum amount of time to wait for the next request when keep-alives are enabled. If IdleTimeout is zero, the value of ReadTimeout is used. If both are zero, there is no timeout"。

  能够看到只有当网关Nginx和上游Golang服务之间采纳keep-alived长连贯时,该配置才失效。须要特地留神的是,如果IdleTimeout等于0,则默认取ReadTimeout的值。

  留神:Nginx在通过proxy_pass配置转发时,默认采纳HTTP 1.0,且"Connection:close"。因而须要增加如下配置能力应用长连贯:

proxy_http_version 1.1;proxy_set_header Connection "";

一个小小的试验

  该场景下的502存在概率问题,只有当上游Golang服务敞开连贯,与新的申请达到,简直同时产生时,才有可能呈现。因而最终只是通过tcpdump抓包察看Golang服务端敞开连贯。

  留神:这种场景有一个显著的特色,upstream_response_time靠近于0。

  记得配置网关Nginx到Golang服务之间采纳长连贯,同时配置IdleTimeout=5秒。只申请一次,tcpdump抓包察看景象如下:

15:15:03.155362 IP 127.0.0.1.20775 > 127.0.0.1.8080: Flags [S], seq 3293818352, win 43690, length 001:50:36.801131 IP 127.0.0.1.8080 > 127.0.0.1.20775: Flags [S.], seq 901857004, ack 3293818353, win 43690, length 015:15:03.155385 IP 127.0.0.1.20775 > 127.0.0.1.8080: Flags [.], ack 1, win 86, length 015:15:03.155406 IP 127.0.0.1.20775 > 127.0.0.1.8080: Flags [P.], seq 1:135, ack 1, win 86, length 134: HTTP: GET /test HTTP/1.115:15:03.155411 IP 127.0.0.1.8080 > 127.0.0.1.20775: Flags [.], ack 135, win 88, length 015:15:03.155886 IP 127.0.0.1.8080 > 127.0.0.1.20775: Flags [P.], seq 1:130, ack 135, win 88, length 129: HTTP: HTTP/1.1 200 OK15:15:03.155894 IP 127.0.0.1.20775 > 127.0.0.1.8080: Flags [.], ack 130, win 88, length 0//IdleTimeout 5秒超时后,Golang服务被动断开连接15:15:08.156130 IP 127.0.0.1.8080 > 127.0.0.1.20775: Flags [F.], seq 130, ack 135, win 88, length 015:15:08.156234 IP 127.0.0.1.20775 > 127.0.0.1.8080: Flags [F.], seq 135, ack 131, win 88, length 015:15:08.156249 IP 127.0.0.1.8080 > 127.0.0.1.20775: Flags [.], ack 136, win 88, length 0

  Golang服务被动断开连接,网关Nginx就有产生502的可能。怎么解决这个问题呢?只有保障每次都是网关Nginx被动断开连接即可。

  ngx_http_upstream_module模块在新版本增加了一个配置参数,keepalive_timeout。留神这不是ngx_http_core_module模块里的,两个参数名称一样,然而用处不一样。

Syntax:    keepalive_timeout timeout;Default:    keepalive_timeout 60s;Context:    upstreamThis directive appeared in version 1.15.3.Sets a timeout during which an idle keepalive connection to an upstream server will stay open.

  显然只有Golang服务配置的IdleTimeout大于这里的keepalive_timeout即可。

源码剖析

  整个处理过程在http.(c *conn).serve办法,逻辑如下:

func (c *conn) serve(ctx context.Context) {    defer func() {        //函数返回前敞开连贯        c.close()    }()        //循环解决    for {        //读申请        w, err := c.readRequest(ctx)        //解决申请        serverHandler{c.server}.ServeHTTP(w, w.req)        //响应        w.finishRequest()                //没有开启keepalive,则敞开连贯        if !w.conn.server.doKeepAlives() {            return        }                if d := c.server.idleTimeout(); d != 0 {            //设置读超时工夫为idleTimeout            c.rwc.SetReadDeadline(time.Now().Add(d))                        //阻塞读,待idleTimeout超时后,返回谬误            if _, err := c.bufr.Peek(4); err != nil {                return            }        }            //接管到新的申请,清空ReadTimeout        c.rwc.SetReadDeadline(time.Time{})    }}

  能够看到在设置读超时工夫为idleTimeout后,会再次从连贯读取数据,此时会阻塞以后协程;待超时后,读取操作会返回谬误,此时serve办法返回,而在返回前会敞开该连贯。

panic

  panic会导致程序的异样终止,Golang服务都终止了,Nginx必然会产生刹时大量502了。对应的,通常咱们会通过recover捕捉异样,恢复程序的执行。

defer func() {    if err := recover(); err != nil {        //打印调用栈        stack := stack(3)        //记录申请详情        httprequest, _ := httputil.DumpRequest(c.Request, false)        logger.E("[Recovery]", " %s panic recovered:\n%s\n%s\n%s", time.Now().Format("2006/01/02 - 15:04:05"), string(httprequest), err, stack)        c.AbortWithStatus(500)    }}()

  另外,对于HTTP解决handler来说,如果呈现panic且没有recover时,Golang sdk提供了recover逻辑,只是Golang sdk简略的敞开连贯罢了,而这一解决同样会导致Nginx返回502。这种502最容易排查了,只须要查找有无"http: panic serving"日志即可。

func (c *conn) serve(ctx context.Context) {    defer func() {        if err := recover(); err != nil && err != ErrAbortHandler {            const size = 64 << 10            buf := make([]byte, size)            buf = buf[:runtime.Stack(buf, false)]            c.server.logf("http: panic serving %v: %v\n%s", c.remoteAddr, err, buf)        }        if !c.hijacked() {            c.close()            c.setState(c.rwc, StateClosed)        }    }()}