关于istio:好心分手-Istio-网格节点故障快速恢复初探

开始测试
寻根
- TCP half-open
- keepalive
- 重传 timeout
- Zero window timeout
- 利用 socket 层的 timeout 设置
  - TCP_USER_TIMEOUT
  - SO_RCVTIMEO / SO_SNDTIMEO
  - poll timeout
- 寻根总结
较真有什么用
闲暇连贯的 keepalive 查看
- 作为 upstream(服务端) 时
- 作为 downstream(客户端) 时
TCP_USER_TIMEOUT
Envoy 应用层衰弱检测
- 衰弱检测与连接池
- 衰弱检测与 endpoint 发现
- 被动衰弱检测: Health checking
- 被动衰弱检测: Outlier detection
- 衰弱检测与 EDS，听谁的？
Envoy 应用层的超时
- Envoy 应用层的连贯级超时
- Envoy 应用层的申请级超时
  - 对 downstream(client) 的申请读超时
  - 对 upstream(server) 的响应期待超时
- 谈超时，别忘记 retry 影响
思考
一点总结
次要参考

如果图片不清，请转到原文

最近，须要对 k8s cluster + VIP load balance + Istio 的环境做一些 HA / Chaos Testing(混沌测试)。如下图，在此环境中，须要看看 worker node B 在非正常关机或网络分区的状况下，对外部用户(Client) 的影响：

申请成功率影响
性能 (TPS/Response Time) 影响

上图须要阐明一下：

对外部的 VIP(虚构 IP) 的 TCP/IP 层负载平衡，是通过 ECMP(Equal-Cost Multi-Path) 的 Modulo-N 算法，调配负载，它实质上就是用 TCP 连贯的 5 元组 (协定、srcIP、srcPort、dstIP、dstPort) 去调配内部流量了。留神，这种负载平衡算法是 无状态 的，在指标数量发生变化时，负载平衡算法的后果也会发生变化。即是 不稳固算法。
dstIP 为 VIP 的 TCP 流量，来到 woker node 后，再由 ipvs/conntrack 规定做有状态的，DNAT，dstIP 被映射和转换为任意一个 Istio Gateway POD 的地址。留神，这种负载平衡算法是 有状态 的，在指标数量发生变化时，原有连贯的负载平衡后果不会发生变化。即算是 稳固算法。
Istio Gateway POD 对 HTTP/TCP 流量也做了负载平衡。两种协定的区别是：
- 对于 HTTP。同一 downstream(流量收回方) 的一个连贯的多个申请，可能被负载平衡到不同的 upstream(流量指标)
- 对于 TCP。同一 downstream(流量收回方) 的一个连贯的多个数据包，会被负载平衡到同一 upstream(流量指标)

Chaos Testing 的办法是暴力敞开 worker node B。如上图，能够推断出 红色 与绿色 线的连贯，都会间接影响到。从客户端看到的影响是：

申请成功率只升高了 0.01%
TPS 升高了 1/2，继续了半小时后，才复原回来。
Avg Response Time(均匀响应工夫) 根本不变

须要留神的是，单个 Worker Node 的各类资源不是这个测试的性能瓶颈。那么，问题呈现在什么中央？

客户端是个 JMeter 程序，通过细看其产生的测试报告，发现 worker node 敞开后，Avg Response Time 是变动不大。但 P99 与 MAX 的 Response Time 变得异样地大。可见，Avg Response Time 这货色暗藏了很多货色，测试的线程，很可能是 Block(阻塞)在什么中央了，才导致 TPS 降落。

通过一翻折腾，起初批改了 内部客户端 的 JMeter 的超时工夫为 6s，问题解决。worker node 敞开后，TPS 疾速复原。

内部客户端的问题解决了。就能够出工开饭了。但作为一个爱折腾的人，我想找寻其起因。更想晓得，这个状况是真疾速复原了，还是暗藏也什么隐患。

开始前先讲一个概念：

📖 TCP half-open

依据 RFC 793，当 TCP 连贯一端的主机解体，或者在没有告诉另一端的状况下删除了套接字时，TCP 连贯被称为 半关上。如果半关上端闲暇（即无数据 /keepalive 发送），则连贯可能会在有限时间段内放弃半关上状态。

在 worker node B 敞开后，从 内部客户端 的角度看，如上图，其到 worker node B 的 TCP 连贯可能处于两种状态：

client kernel 层因为发送(或重发) 数据、或闲置达到 keepalive 工夫，须要发送数据包到对端。worker node A 收到这个数据包，因为是不非法的 TCP，所以可能的状况是：
- 响应了 TCP RESET。client 收到后敞开了连贯。client Block(阻塞)在 socket 的线程也因连贯被敞开而返回，持续运行且敞开 socket
- 因为 DNAT 映射表找不到相干的连贯，数据包间接 drop 了，不响应。client Block 在 socket 的线程持续 Block。即产生了TCP half-open
client 连贯没启用 keepalive，或闲置未达到 keepalive 工夫，内核层也无数据须要发送(或重发)，client 线程 Block 在 socket read 期待，即产生了TCP half-open

能够看到，对于 client 来说，在很大概率下，要发现一个连贯曾经生效了，均须要肯定的工夫。在最差的状况下，如没启动 keepalive，可能永远发现不了 TCP half-open。

来自 [TCP/IP Illustrated Volume 1]

keepalive 探测是一个空的（或 1 字节）segment(段)，其序列号比迄今为止从 对端 (peer) 看到的最大 ACK 号小 1。因为这个序列号曾经被 peer 接管，peer 再收到这个空 segment 不会有任何副作用，但它会引发一个 peer 返回一个 ACK，用于确定peer 是否存活。探测 probe segment 及其 ACK 均不蕴含任何新数据。

探测 probe segment 如果失落，TCP 也不会从新传输。[RFC1122] 规定，因为这一事实，单个 keepalive 探测收不到 ACK 不应被视为对端已死的充沛证据。须要屡次距离探测。

如果 socket 关上了 SO_KEEPALIVE，那么就是启用了 keepalive。

对于启用了 keepalive 的 TCP 连贯，Linux 有如下全局默认配置：

https://www.kernel.org/doc/ht…

tcp_keepalive_time – INTEGER

How often TCP sends out keepalive messages when keepalive is enabled. Default: 2 hours.
tcp_keepalive_probes – INTEGER

How many keepalive probes TCP sends out, until it decides that the connection is broken. Default value: 9.
tcp_keepalive_intvl – INTEGER

How frequently the probes are send out. Multiplied by tcp_keepalive_probes it is time to kill not responding connection, after probes started. Default value: 75 sec i.e. connection will be aborted after ~11 minutes of retries.

同时，Linux 也提供了为每个 socket 独立指定的配置项：

https://man7.org/linux/man-pa…

       TCP_KEEPCNT (since Linux 2.4)
              The maximum number of keepalive probes TCP should send
              before dropping the connection.  This option should not be
              used in code intended to be portable.

       TCP_KEEPIDLE (since Linux 2.4)
              The time (in seconds) the connection needs to remain idle
              before TCP starts sending keepalive probes, if the socket
              option SO_KEEPALIVE has been set on this socket.  This
              option should not be used in code intended to be portable.

       TCP_KEEPINTVL (since Linux 2.4)
              The time (in seconds) between individual keepalive probes.
              This option sh

能够计算，默认状况下，一个连贯的最快被 keepalive 敞开的时长：

TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT = 2*60*60 + 75*9 = 2 小时 + 11 分钟

https://www.kernel.org/doc/Do…

- tcp_retries2 - INTEGER

This value influences the timeout of an alive TCP connection, when RTO retransmissions remain unacknowledged. Given a value of N, a hypothetical TCP connection following exponential backoff with an initial RTO of TCP_RTO_MIN would retransmit N times before killing the connection at the (N+1)th RTO.The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.RFC 1122 recommends at least 100 seconds for the timeout, which corresponds to a value of at least 8.

下面配置项，配置重传状态下，要指数让步多少次重传，内核才敞开连贯。默认的配置是 15。计算转换成工夫约是 924s，约 15 分钟。

当 对端 通告其窗口大小为零时，这表明对端 TCP 接收缓冲区已满，无奈接管更多数据。它可能是因为对端资源缓和而数据处理太慢，最终导致 TCP 接收缓冲区被填满。

实践上，对端在解决完接管窗口中沉积的数据后，会用 ACK 来告诉窗口凋谢。但因各种起因，有时候，这个 ACK 会失落。

所以，有数据未收回的发送方须要定期探测窗口大小。发送方会从未送达的缓存中，抉择头一个字节数据发送作为探测包。当探测超过肯定次数，对方不响应，或始终响应 0 窗口时，连贯会主动敞开。Linux 中默认是 15 次。配置项是：tcp_retries2。它的探测重试机制和 TCP 重传是相似的。

参考：https://blog.cloudflare.com/w…,Zero%20window,-ESTAB%20is…%20forever

man tcp

       TCP_USER_TIMEOUT (since Linux 2.6.37)
              This option takes an unsigned int as an argument.  When
              the value is greater than 0, it specifies the maximum
              amount of time in milliseconds that transmitted data may
              remain unacknowledged, or bufferred data may remain
              untransmitted (due to zero window size) before TCP will
              forcibly close the corresponding connection and return
              ETIMEDOUT to the application.  If the option value is
              specified as 0, TCP will use the system default.

              Increasing user timeouts allows a TCP connection to
              survive extended periods without end-to-end connectivity.
              Decreasing user timeouts allows applications to "fail
              fast", if so desired.  Otherwise, failure may take up to
              20 minutes with the current system defaults in a normal
              WAN environment.

              This option can be set during any state of a TCP
              connection, but is effective only during the synchronized
              states of a connection (ESTABLISHED, FIN-WAIT-1, FIN-
              WAIT-2, CLOSE-WAIT, CLOSING, and LAST-ACK).  Moreover,
              when used with the TCP keepalive (SO_KEEPALIVE) option,
              TCP_USER_TIMEOUT will override keepalive to determine when
              to close a connection due to keepalive failure.

              The option has no effect on when TCP retransmits a packet,
              nor when a keepalive probe is sent.

              This option, like many others, will be inherited by the
              socket returned by accept(2), if it was set on the
              listening socket.

              Further details on the user timeout feature can be found
              in RFC 793 and RFC 5482 ("TCP User Timeout Option").

即，指定在发送得不到确认(收不到 ACK)，或对端接管窗口为 0 多久后，内核才敞开连贯并返回谬误给利用。

须要留神的是，TCP_USER_TIMEOUT 会影响 keepalive 的 TCP_KEEPCNT 配置成果：

https://blog.cloudflare.com/w…

With TCP_USER_TIMEOUT set, the TCP_KEEPCNT is totally ignored. If you want TCP_KEEPCNT to make sense, the only sensible USER_TIMEOUT value is slightly smaller than:
TCP_USER_TIMEOUT < TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT

https://man7.org/linux/man-pa…

       SO_RCVTIMEO and SO_SNDTIMEO
              Specify the receiving or sending timeouts until reporting
              an error.  The argument is a struct timeval.  If an input
              or output function blocks for this period of time, and
              data has been sent or received, the return value of that
              function will be the amount of data transferred; if no
              data has been transferred and the timeout has been
              reached, then -1 is returned with errno set to EAGAIN or
              EWOULDBLOCK, or EINPROGRESS (for connect(2)) just as if
              the socket was specified to be nonblocking.  If the
              timeout is set to zero (the default), then the operation
              will never timeout.  Timeouts only have effect for system
              calls that perform socket I/O (e.g., read(2), recvmsg(2),
              send(2), sendmsg(2)); timeouts have no effect for
              select(2), poll(2), epoll_wait(2), and so on.

须要留神的是，本例中，咱们的 client 是 JMeter，是 java 实现的，他用了 socket.setSoTimeout 办法来设置超时。依据：

https://stackoverflow.com/que…

和我看到的源码，Linux 实现上应该是用了下一节阐明的 select/poll 的 timeout 参数，而不是下面的 socket Options。

https://github.com/openjdk/jd…

Java JMeter 在 catch 到 SocketTimeoutException 后，就被动 close 了 socket。并重连，所以死 socket 的问题是在应用层解决了。

https://man7.org/linux/man-pa…

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

参考：https://blog.cloudflare.com/w…

要保障连贯在各种状态下均能够比拟快地检测出超时的状况：

启用 TCP keepalive，并配置正当的工夫。这是在闲暇连贯状况下放弃一些数据流动所必须的。
将 TCP_USER_TIMEOUT 设置为 TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT。
在应用层用读写超时检测，并在超时后利用被动敞开连贯。（这是本文的状况）

为何有 TCP keepalive 了，还要 TCP_USER_TIMEOUT ? 起因是如果产生网络分区，重传状态下的连贯，是不会触发 keepalive 探测的。我将原理记录到下图：

🤔 ❓ 说到这里，有同学会问，说到底，这次，你就是调整了个应用层的读超时就行了。钻研和较真那么多其它的干嘛？

这时，咱们回到下图的“初心”来，看看是不是所有隐患都解决了：

很显著，只解决了 External Client 到 k8s worker node B 的红线局部。其它红、绿线，没考察过。这些 tcp half-opent 连贯，是用 tcp keepalive、tcp retransmit timeout、利用(Envoy) 层 timeout 机制疾速敞开了，还是长期未检测到问题而敞开不及时，甚至是连贯透露(connection leak)？

以下可见，Istio gateway 默认无启用 keepalive:

$ kubectl exec -it $ISTIO_GATEWAY_POD -- ss -oipn 'sport 15001 or sport 15001 or sport 8080 or sport 8443'                                                         
Netid               State                Recv-Q                Send-Q                               Local Address:Port                               Peer Address:Port                
tcp                 ESTAB                0                     0                                    192.222.46.71:8080                                10.111.10.101:51092                users:(("envoy",pid=45,fd=665))
         sack cubic wscale:11,11 rto:200 rtt:0.064/0.032 mss:8960 pmtu:9000 rcvmss:536 advmss:8960 cwnd:10 segs_in:2 send 11200000000bps lastsnd:31580 lastrcv:31580 lastack:31580 pacing_rate 22400000000bps delivered:1 rcv_space:62720 rcv_ssthresh:56576 minrtt:0.064

这时，能够用 EnvoyFilter 加上 keepalive：

参考：

https://support.f5.com/csp/ar…

https://www.envoyproxy.io/doc…

https://github.com/istio/isti…

https://istio-operation-bible…

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: ingress-gateway-socket-options
  namespace: istio-system
spec:
  configPatches:
  - applyTo: LISTENER
    match:
      context: GATEWAY
      listener:
        name: 0.0.0.0_8080
        portNumber: 8080
    patch:
      operation: MERGE
      value:
        socket_options:
        - description: enable keep-alive
          int_value: 1
          level: 1
          name: 9
          state: STATE_PREBIND
        - description: idle time before first keep-alive probe is sent
          int_value: 7
          level: 6
          name: 4
          state: STATE_PREBIND
        - description: keep-alive interval
          int_value: 5
          level: 6
          name: 5
          state: STATE_PREBIND
        - description: keep-alive probes count
          int_value: 2
          level: 6
          name: 6
          state: STATE_PREBIND

istio-proxy sidecar 也能够用相似的办法设置。

参考：https://istio.io/latest/docs/…

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: bookinfo-redis
spec:
  host: myredissrv.prod.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        connectTimeout: 30ms
        tcpKeepalive:
          time: 60s
          interval: 20s
          probes: 4

故事说到这里，应该完结了，但，还没有。回顾一下之前的两个图：

这时，retransmit timer 会定时在 TCP 层作重传。这里有两个可能性：

Calico 在 worker node B 断电后，疾速发现问题，更新了 worker node A 的路由表，删除了到 worker node B 的路由。
未及时更新路由

而默认的 retransmit timer，须要 15 分钟才会敞开连贯并告诉利用。如何放慢？

能够用上文提到的 TCP_USER_TIMEOUT 减速 half-open TCP 在重传状态下发现问题 :

https://github.com/istio/isti…

https://github.com/istio/isti…

kind: EnvoyFilter
metadata:
  name: sampleoptions
  namespace: istio-system
spec:
  configPatches:
  - applyTo: CLUSTER
    match:
      context: SIDECAR_OUTBOUND
      cluster:
        name: "outbound|12345||foo.ns.svc.cluster.local"
    patch:
      operation: MERGE
      value:
        upstream_bind_config:
          source_address:
            address: "0.0.0.0"
            port_value: 0
            protocol: TCP
          socket_options:
          - name: 18 #TCP_USER_TIMEOUT
            int_value: 10000
            level: 6

下面减速了 die upstream(服务端解体) 的发现，对于 die downstream，可能能够用相似的办法，配置在 listener。

故事说到这里，真应该完结了，但，还没有。

应用层的衰弱检测，也可能能够减速发现 upstream cluster 的 TCP half-open，或者说是 endpoint outlier 问题。 留神，这里的衰弱检测，不是 k8s 的 liveness/readiness probe。是 pod 到 pod 间的衰弱检测，包含 pod 与 pod 间的连通性。

Envoy 有两种衰弱检测：

被动衰弱检测: Health checking
被动衰弱检测: Outlier detection

见：Health checking interactions

如果 Envoy 配置为被动或被动健康检查，则所有从 可用状态 转换为 不可用状态 的主机 的连接池 都将敞开。如果主机痊愈从新进入负载平衡，它将创立新的连贯，这将最大限度地解决死连贯的问题（因为 ECMP 路由或其余起因）。

见：On eventually consistent service discovery

https://www.envoyproxy.io/doc…

https://istio.io/latest/docs/…

https://www.envoyproxy.io/doc…

https://www.envoyproxy.io/doc…

kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: httpbin
spec:
  host: httpbin
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1 #The maximum number of requests that will be queued while waiting for a ready connection pool connection
    outlierDetection:
      consecutive5xxErrors: 1
      interval: 1s
      baseEjectionTime: 3m
      maxEjectionPercent: 100
EOF

当 worker node B 断电后，其中运行的 pod 的状态最终（默认大略 10 分钟）到了 Terminaling。k8s 会告诉 istiod 去删除这个 endpoint。那么问题来了，到底，是 EDS 快，还是 health check 检测到生效快，Envoy 以哪个数据为负载抉择的根据？

这个问题在这个文档中，有一些探讨：

On eventually consistent service discovery

Envoy was designed from the beginning with the idea that service discovery does not require full consistency. Instead, Envoy assumes that hosts come and go from the mesh in an eventually consistent way. Our recommended way of deploying a service to service Envoy mesh configuration uses eventually consistent service discovery along with active health checking (Envoy explicitly health checking upstream cluster members) to determine cluster health. This paradigm has a number of benefits:

All health decisions are fully distributed. Thus, network partitions are gracefully handled (whether the application gracefully handles the partition is a different story).

When health checking is configured for an upstream cluster, Envoy uses a 2×2 matrix to determine whether to route to a host:

Discovery Status(服务发现状态) Health Check OK Health Check Failed

Discovered Route(参加负载平衡) Don’t Route

Absent(缺失) Route(参加负载平衡) Don’t Route / Delete

Host discovered / health check OK

Envoy will route to the target host.

Host absent / health check OK:

Envoy will route to the target host. This is very important since the design assumes that the discovery service can fail at any time. If a host continues to pass health check even after becoming absent from the discovery data, Envoy will still route. Although it would be impossible to add new hosts in this scenario, existing hosts will continue to operate normally. When the discovery service is operating normally again the data will eventually re-converge.

Host discovered / health check FAIL

Envoy will not route to the target host. Health check data is assumed to be more accurate than discovery data.

Host absent / health check FAIL

Envoy will not route and will delete the target host. This is the only state in which Envoy will purge host data.

有一点没齐全了解的是，这里的 Absent是指 EDS 服务拜访失败时 Absent，还是拜访胜利了，但后果中没再呈现一个原有的 endpoint。

Discovery Status(服务发现状态)	Health Check OK	Health Check Failed
Discovered	Route(参加负载平衡)	Don’t Route
Absent(缺失)	<mark>Route(参加负载平衡)</mark>	Don’t Route / Delete

回顾一下之前的图：

大略能够晓得，什么中央能够思考用 health check 配置来减速问题发现。

新建连贯超时：connect_timeout，Istio 默认 10s，这个配置影响到 outlier detection 的时效。
闲暇连贯超时：idle_timeout，默认 1 小时
- Istio destination-rule
最大连贯时长：max_connection_duration，默认有限

Envoy:

Envoy 应用层的申请接管超时：request_timeout，默认有限
header 读超时：request_headers_timeout，默认有限
更多见：https://www.envoyproxy.io/doc…

即从残缺读取 downsteam 的 request 后开始算，到从 upstream 残缺读取 response 的工夫。见：

Envoy Route timeouts

https://istio.io/latest/docs/…

见：Istio retry

如果断电的 worker node 重新启动，以前的对端能够疾速收到 TCP RST 而断开生效连贯吗？

如果断电的 worker node 的连贯解决链路上有 NAT/conntrack 参加，会话与端口映射状态失落后，肯定会返回 TCP RST 吗？还是 drop 了？

本文写得有点乱。说实话，有的配置和原理互相关联与影响，要齐全梳理难度很大：

TCP 层的各种 timeout
syscall 的各种 timeout
应用层的各种 timeout
Health check 与 Outlier detection
Retry

心愿有一天，有人（或者本人）能够说分明这些事件。本文的指标是把事件的变量都先记录下来，确定一个范畴，再去细分斟酌原理。心愿对读者有用 🥂

https://blog.cloudflare.com/w…
https://codearcana.com/posts/…
https://www.evanjones.ca/tcp-…

开始测试

寻根

TCP half-open

keepalive

重传 timeout

Zero window timeout

利用 socket 层的 timeout 设置

TCP_USER_TIMEOUT

SO_RCVTIMEO / SO_SNDTIMEO

poll timeout

寻根总结

较真有什么用

闲暇连贯的 keepalive 查看

作为 upstream(服务端) 时

作为 downstream(客户端) 时

TCP_USER_TIMEOUT

Envoy 应用层衰弱检测

衰弱检测与连接池

衰弱检测与 endpoint 发现

被动衰弱检测: Health checking

被动衰弱检测: Outlier detection

衰弱检测与 EDS，听谁的？

Envoy 应用层的超时

Envoy 应用层的连贯级超时

Envoy 应用层的申请级超时

对 downstream(client) 的申请读超时

对 upstream(server) 的响应期待超时

谈超时，别忘记 retry 影响

思考

一点总结

次要参考