关于go:54-Golang实战Go服务502总结

生产环境 Golang 服务有时会产生 502 报警，排查发现大多是以下三种起因造成的：

http.Server 配置了 WriteTimeout，申请解决超时，Golang 服务断开连接；
http.Server 配置了 IdleTimeout，且网关和 Golang 之间应用长连贯，Golang 服务被动断开连接；
Golang 服务解决 HTTP 申请时呈现了 panic。

本篇文章次要介绍这三种 502 产生的具体起因，另外本篇文章提到的网关是基于 Nginx 搭建的。

WriteTimeout 的含意是：从接管到客户端申请开始，到返回响应的最大超时工夫，默认为 0 意味着不超时。正文如下：

type Server struct {
    // WriteTimeout is the maximum duration before timing out
    // writes of the response. It is reset whenever a new
    // request's header is read. Like ReadTimeout, it does not
    // let Handlers make decisions on a per-request basis.
    // A zero or negative value means there will be no timeout.
    WriteTimeout time.Duration
}

正文表明，当接管到新的申请时，超时工夫会重置。那么什么时候设置写超时工夫呢？当然是在 http.Server 读取客户端申请实现时了，参考 readRequest 办法：

func (c *conn) readRequest(ctx context.Context) (w *response, err error) {
    if d := c.server.WriteTimeout; d != 0 {
        //defer，也就是函数返回时设置定时器
        defer func() {c.rwc.SetWriteDeadline(time.Now().Add(d))
        }()}
}

可是，为什么 WriteTimeout 会导致 502 呢？如果你凑巧也遇到了之类问题，能够查看网关拜访日志，你会看到所有申请的响应 request_time 都大于 WriteTimeout，并且谬误日志显示的是 ”upstream prematurely close connection while reading response header from upstream”，也就是在网关等到上游响应时，上游敞开连贯了。剩下的问题就是为什么上游会敞开连贯了。

咱们先模仿以下申请解决超时的景象：

package main

import (
    "net/http"
    "time"
)

type GinHandler struct {
}

func (* GinHandler) ServeHTTP(w http.ResponseWriter, r *http.Request){time.Sleep(time.Duration(5) * time.Second)
    w.Write([]byte("hello golang"))

}


func main() {
    server := &http.Server{
        Addr:"0.0.0.0:8080",
        Handler: &GinHandler{},
        ReadTimeout: time.Second * 3,
        WriteTimeout: time.Second *3,
    }

    server.ListenAndServe()}

申请后果如下：

time  curl http://127.0.0.1/test -H "Host:test.xxx.com"

502 Bad Gateway

real    0m5.011s

查看 Nginx 的谬误日志，能够看到是上游被动敞开连贯造成的：

21:18:07 [error] 30217#0: *8105 upstream prematurely closed connection while reading response header from upstream, client: 127.0.0.1, server: test.xxx.com, request: "GET /test HTTP/1.1", upstream: "http://127.0.0.1:8080/test", host: "test.xxx.com"

通过 tcpdump 再抓包验证下；能够看到，35 秒承受到客户端申请，40 秒时，服务端被动断开了连贯。

20:52:35.660325 IP 127.0.0.1.31227 > 127.0.0.1.8080: Flags [S], seq 3442614457, win 43690, length 0
14:05:54.073175 IP 127.0.0.1.8080 > 127.0.0.1.31227: Flags [S.], seq 3655119156, ack 3442614458, win 43690, length 0
20:52:35.660353 IP 127.0.0.1.31227 > 127.0.0.1.8080: Flags [.], ack 1, win 86, length 0
20:52:35.660415 IP 127.0.0.1.31227 > 127.0.0.1.8080: Flags [P.], seq 1:106, ack 1, win 86, length 105: HTTP: GET /test HTTP/1.0
20:52:35.660425 IP 127.0.0.1.8080 > 127.0.0.1.31227: Flags [.], ack 106, win 86, length 0
20:52:40.662303 IP 127.0.0.1.8080 > 127.0.0.1.31227: Flags [F.], seq 1, ack 106, win 86, length 0
20:52:40.662558 IP 127.0.0.1.31227 > 127.0.0.1.8080: Flags [F.], seq 106, ack 2, win 86, length 0
20:52:40.662605 IP 127.0.0.1.8080 > 127.0.0.1.31227: Flags [.], ack 107, win 86, length 0

看到这里有个纳闷，5 秒后 Golang 服务才断开连接的，咱们不是设置了 WriteTimeout= 3 秒吗？为什么 3 秒超时一直开，而是 5 秒后才断开？

net.http.(c * conn).serve 办法是 Golang sdk 解析解决 HTTP 申请的主函数，咱们能够找到 WriteTimeout 的设置入口，c.rwc.SetWriteDeadline。这里的 rwc 类型为 net.Conn，是一个接口。真正的对象是由 l.Accept() 返回的，而 l 是对象 TCPListener。往下追踪能够发现创立的是 net.TCPConn 对象，而该对象继承了 net.conn，net.conn 又实现了 net.Conn 接口（留神大小写）。

type TCPConn struct {conn}

type conn struct {fd *netFD}

func (c *conn) SetReadDeadline(t time.Time) error {if err := c.fd.SetReadDeadline(t); err != nil {}}

再持续往下跟踪，调用链是这样的：

net.(c *conn).SetWriteDeadline()
    net.(fd *netFD).SetWriteDeadline()
        poll.(fd *FD).SetWriteDeadline()
            poll.setDeadlineImpl()
                poll.runtime_pollSetDeadline()

追到这，你会发现追不上来了，runtime_pollSetDeadline 的实现逻辑是什么呢？咱们全局搜素 runtime_pollSetDeadline，能够找到在 runtime/netpoll.go 文件。

//go:linkname poll_runtime_pollSetDeadline internal/poll.runtime_pollSetDeadline
func poll_runtime_pollSetDeadline(pd *pollDesc, d int64, mode int) {

    // 写超时回调函数
    pd.wt.f = netpollWriteDeadline
    // 定时器到期后，调用回调函数的传入参数
    pd.wt.arg = pd
}

咱们的问题行将上不着天; 下不着地，只须要剖析回调函数 netpollWriteDeadline（对立由 netpolldeadlineimpl 实现）做了什么，即可明确 WriteTimeout 到底是什么。

func netpolldeadlineimpl(pd *pollDesc, seq uintptr, read, write bool) {
    if write {
        pd.wd = -1
        atomic.StorepNoWB(unsafe.Pointer(&pd.wt.f), nil) 
        wg = netpollunblock(pd, 'w', false)
    }
    
    if wg != nil {netpollgoready(wg, 0)
    }
}

能够看到，设置 pd.wd=-1，后续以此判断该描述符上的写定时器是否曾经过期。netpollunblock 与 netpollgoready，用于判断是否有协程因为以后描述符阻塞，如果有将其状态置为可运行，并退出可运行队列；否则什么都不做。

这下所有都明确了，WriteTimeout 只是增加了读写超时定时器，待定时器过期时，也仅仅是设置该描述符读写超时。所以哪怕申请解决工夫是 5 秒远超过 WriteTimeout 设置的超时工夫，该申请仍然不会被中断，会失常执行直到完结，在向客户端返回后果时才会产生异样。

该过程参照下图：

最初咱们通过 dlv 调试，打印一下设置 WriteTimeout 的函数调用链。

0  0x0000000001034c23 in internal/poll.runtime_pollSetDeadline
   at /usr/local/go/src/runtime/netpoll.go:224
1  0x00000000010ee3a0 in internal/poll.setDeadlineImpl
   at /usr/local/go/src/internal/poll/fd_poll_runtime.go:155
2  0x00000000010ee14a in internal/poll.(*FD).SetReadDeadline
   at /usr/local/go/src/internal/poll/fd_poll_runtime.go:132
3  0x00000000011cc868 in net.(*netFD).SetReadDeadline
   at /usr/local/go/src/net/fd_unix.go:276
4  0x00000000011dffca in net.(*conn).SetReadDeadline
   at /usr/local/go/src/net/net.go:251
5  0x000000000131491f in net/http.(*conn).readRequest
   at /usr/local/go/src/net/http/server.go:953
6  0x000000000131bd5a in net/http.(*conn).serve
   at /usr/local/go/src/net/http/server.go:1822

方才设置超时工夫会调用 poll.(fd FD).SetWriteDeadline，FD 构造封装了底层描述符，其必然提供读写处理函数。咱们在 poll.(fd FD).Write 打断点，看一下调用链路：

(dlv) b poll.Write
Breakpoint 1 set at 0x10ef40b for internal/poll.(*FD).Write() /usr/local/go/src/internal/poll/fd_unix.go:254

(dlv) bt
0  0x00000000010ef40b in internal/poll.(*FD).Write
   at /usr/local/go/src/internal/poll/fd_unix.go:254
1  0x00000000011cc0ac in net.(*netFD).Write
   at /usr/local/go/src/net/fd_unix.go:220
2  0x00000000011df765 in net.(*conn).Write
   at /usr/local/go/src/net/net.go:196
3  0x000000000132276c in net/http.checkConnErrorWriter.Write
   at /usr/local/go/src/net/http/server.go:3434
4  0x0000000001177e91 in bufio.(*Writer).Flush
   at /usr/local/go/src/bufio/bufio.go:591
5  0x000000000131a329 in net/http.(*response).finishRequest
   at /usr/local/go/src/net/http/server.go:1594
6  0x000000000131c7c5 in net/http.(*conn).serve
   at /usr/local/go/src/net/http/server.go:1900

与咱们构想的统一，待申请解决实现后，哪怕超时了，下层业务仍然会尝试写响应后果。持续往下追究，看看是什么时候判断描述符超时了：

poll.(fd *FD).Write
    poll.(pd *pollDesc).prepareWrite
        poll.(pd *pollDesc).prepare
            poll.runtime_pollReset

//go:linkname poll_runtime_pollReset internal/poll.runtime_pollReset
func poll_runtime_pollReset(pd *pollDesc, mode int) int {errcode := netpollcheckerr(pd, int32(mode))
    if errcode != pollNoError {return errcode}
}

func netpollcheckerr(pd *pollDesc, mode int32) int {
    if pd.closing {return 1 // ErrFileClosing or ErrNetClosing}
    if (mode == 'r' && pd.rd < 0) || (mode == 'w' && pd.wd < 0) {return 2 // ErrTimeout}
    ……
    return 0
}

函数 runtime_pollReset 的实现逻辑同样是在 runtime/netpoll.go 文件。其通过 netpollcheckerr 实现校验逻辑，判断描述符是否敞开，是否超时等等，这里返回了超时谬误。最终，最顶层 http.(c * conn).serve 办法判断写响应出错，间接返回，返回前执行 defer 做收尾工作，比方敞开连贯，由此引发网关返回 502 谬误。

参照下面的剖析，通过 WriteTimeout 管制申请的超时工夫，存在两个问题：1）申请解决超时后 Golang 会断开连接，网关会呈现 502；2）申请不会因为 WriteTimeout 超时而被中断，须要等到申请真正解决实现，客户端期待响应工夫较长；在遇到大量慢申请时，Golang 服务资源占用会急剧减少。

那么如何优雅的管制超时状况呢？超时后 Golang 间接完结以后申请，并向客户端返回默认后果。能够利用 context.Context 实现，这是一个接口，有多种实现，如 cancelCtx 可勾销的上下文，timerCtx 可定时勾销的上下文，valueCtx 可基于上下文传递变量。

上面举一个小小的例子：

ctx := context.Background()
ctx, cancel := context.WithCancel(ctx)

go Proc()

time.Sleep(time.Second)
cancel()


func Proc(ctx context.Context)  {
    for {
        select {case <- ctx.Done():
            return 
        default:
            //do something
        }
    }
}

对于 context 更多应用与原理，能够参考之前的文章。

Golang sdk 的正文阐明如下：

”IdleTimeout is the maximum amount of time to wait for the next request when keep-alives are enabled. If IdleTimeout is zero, the value of ReadTimeout is used. If both are zero, there is no timeout”。

能够看到只有当网关和上游 Golang 服务之间采纳 keep-alived 长连贯时，该配置才失效。须要特地留神的是，如果 IdleTimeout 等于 0，则默认取 ReadTimeout 的值。

留神：Nginx 在通过 proxy_pass 配置转发时，默认采纳 HTTP 1.0，且 ”Connection:close”。因而须要增加如下配置能力应用长连贯：

proxy_http_version 1.1;
proxy_set_header Connection "";

该场景下的 502 存在概率问题，只有当上游 Golang 服务敞开连贯，与新的申请达到，简直同时产生时，才有可能呈现。因而最终只是通过 tcpdump 抓包察看 Golang 服务端敞开连贯。

这种场景产生的根本原因是，网关转发申请到上游的同时，上游服务敞开了长连贯，此时 TCP 档次会返回 RST 包。所以这种 502 有一个显著的特色，upstream_response_time 靠近于 0，并且查问网关谬误日志能够看到 ”connection reset by peer”。

记得配置网关 Nginx 到 Golang 服务之间采纳长连贯，同时配置 IdleTimeout= 5 秒。只申请一次，tcpdump 抓包察看景象如下：

15:15:03.155362 IP 127.0.0.1.20775 > 127.0.0.1.8080: Flags [S], seq 3293818352, win 43690, length 0
01:50:36.801131 IP 127.0.0.1.8080 > 127.0.0.1.20775: Flags [S.], seq 901857004, ack 3293818353, win 43690, length 0
15:15:03.155385 IP 127.0.0.1.20775 > 127.0.0.1.8080: Flags [.], ack 1, win 86, length 0
15:15:03.155406 IP 127.0.0.1.20775 > 127.0.0.1.8080: Flags [P.], seq 1:135, ack 1, win 86, length 134: HTTP: GET /test HTTP/1.1
15:15:03.155411 IP 127.0.0.1.8080 > 127.0.0.1.20775: Flags [.], ack 135, win 88, length 0
15:15:03.155886 IP 127.0.0.1.8080 > 127.0.0.1.20775: Flags [P.], seq 1:130, ack 135, win 88, length 129: HTTP: HTTP/1.1 200 OK
15:15:03.155894 IP 127.0.0.1.20775 > 127.0.0.1.8080: Flags [.], ack 130, win 88, length 0

//IdleTimeout 5 秒超时后，Golang 服务被动断开连接
15:15:08.156130 IP 127.0.0.1.8080 > 127.0.0.1.20775: Flags [F.], seq 130, ack 135, win 88, length 0
15:15:08.156234 IP 127.0.0.1.20775 > 127.0.0.1.8080: Flags [F.], seq 135, ack 131, win 88, length 0
15:15:08.156249 IP 127.0.0.1.8080 > 127.0.0.1.20775: Flags [.], ack 136, win 88, length 0

Golang 服务被动断开连接，网关就有可能产生 502。怎么解决这个问题呢？只有保障每次都是网关被动断开连接即可。

ngx_http_upstream_module 模块在新版本增加了一个配置参数，keepalive_timeout。留神这不是 ngx_http_core_module 模块里的，两个参数名称一样，然而用处不一样。

Syntax:    keepalive_timeout timeout;
Default:    
keepalive_timeout 60s;
Context:    upstream
This directive appeared in version 1.15.3.

Sets a timeout during which an idle keepalive connection to an upstream server will stay open.

显然只有 Golang 服务配置的 IdleTimeout 大于这里的 keepalive_timeout 即可。

整个处理过程在 http.(c * conn).serve 办法，逻辑如下：

func (c *conn) serve(ctx context.Context) {defer func() {
        // 函数返回前敞开连贯
        c.close()}()
    
    // 循环解决
    for {
        // 读申请
        w, err := c.readRequest(ctx)
        // 解决申请
        serverHandler{c.server}.ServeHTTP(w, w.req)
        // 响应
        w.finishRequest()
        
        // 没有开启 keepalive，则敞开连贯
        if !w.conn.server.doKeepAlives() {return}
        
        if d := c.server.idleTimeout(); d != 0 {
            // 设置读超时工夫为 idleTimeout
            c.rwc.SetReadDeadline(time.Now().Add(d))
            
            // 阻塞读，待 idleTimeout 超时后，返回谬误
            if _, err := c.bufr.Peek(4); err != nil {return}
        }
    
        // 接管到新的申请，清空 ReadTimeout
        c.rwc.SetReadDeadline(time.Time{})
    }

}

能够看到在设置读超时工夫为 idleTimeout 后，会再次从连贯读取数据，此时会阻塞以后协程；待超时后，读取操作会返回谬误，此时 serve 办法返回，而在返回前会敞开该连贯。

Golang 服务级的 panic 会导致程序的异样终止，Golang 服务都终止了，网关必然会产生刹时大量 502 了（上游服务挂了，网关会呈现大量谬误日志 ”connection refused by peer”）。所以对应的，通常咱们会通过 recover 捕捉异样，恢复程序的执行。

defer func() {if err := recover(); err != nil {
        // 打印调用栈
        stack := stack(3)
        // 记录申请详情
        httprequest, _ := httputil.DumpRequest(c.Request, false)
        logger.E("[Recovery]", "%s panic recovered:\n%s\n%s\n%s", time.Now().Format("2006/01/02 - 15:04:05"), string(httprequest), err, stack)
        c.AbortWithStatus(500)
    }
}()

另外，对于 HTTP 解决 handler 来说，如果呈现 panic 且没有 recover 时，Golang sdk 提供了 recover 逻辑，只是比拟粗犷，简略的敞开连贯罢了，而这一解决同样会导致 Nginx 返回 502。这种 502 最容易排查了，只须要查找有无 ”http: panic serving” 日志即可。

func (c *conn) serve(ctx context.Context) {defer func() {if err := recover(); err != nil && err != ErrAbortHandler {
            const size = 64 << 10
            buf := make([]byte, size)
            buf = buf[:runtime.Stack(buf, false)]
            c.server.logf("http: panic serving %v: %v\n%s", c.remoteAddr, err, buf)
        }
        if !c.hijacked() {c.close()
            c.setState(c.rwc, StateClosed)
        }
    }()}

这种类型的 502 特色也比拟显著，网关拜访日志 request_time 个别比拟短，比方几十毫秒等，与你申请的响应工夫大体一致，而且网关谬误日志是 ”upstream prematurely closed connection while reading response header from upstream”。

本篇文章次要介绍了 Golang 服务常见的几种 502 谬误，把握 502 产生的实质，以及每种 502 的景象，如响应工夫，谬误日志等等，再遇到这种问题置信你很快就能解决了。

关于go:54-Golang实战Go服务502总结

问题引入

WriteTimeout

一个小小的试验

WriteTimeout 到底做了什么

超时怎么管制？

IdleTimeout + 长连贯

一个小小的试验

源码剖析

panic

总结

Just My Socks（注册教程内含优惠码）

关于go:54-Golang实战Go服务502总结

问题引入

WriteTimeout

一个小小的试验

WriteTimeout 到底做了什么

超时怎么管制？

IdleTimeout + 长连贯

一个小小的试验

源码剖析

panic

总结

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）