目录
- 开始测试
-
寻根
- TCP half-open
- keepalive
- 重传 timeout
- Zero window timeout
-
利用 socket 层的 timeout 设置
- TCP_USER_TIMEOUT
- SO_RCVTIMEO / SO_SNDTIMEO
- poll timeout
- 寻根总结
- 较真有什么用
-
闲暇连贯的 keepalive 查看
- 作为 upstream(服务端) 时
- 作为 downstream(客户端) 时
- TCP_USER_TIMEOUT
-
Envoy 应用层衰弱检测
- 衰弱检测与连接池
- 衰弱检测与 endpoint 发现
- 被动衰弱检测: Health checking
- 被动衰弱检测: Outlier detection
- 衰弱检测与 EDS,听谁的?
-
Envoy 应用层的超时
- Envoy 应用层的连贯级超时
-
Envoy 应用层的申请级超时
- 对 downstream(client) 的申请读超时
- 对 upstream(server) 的响应期待超时
- 谈超时,别忘记 retry 影响
- 思考
- 一点总结
- 次要参考
如果图片不清,请转到原文
最近,须要对 k8s cluster + VIP load balance + Istio 的环境做一些 HA / Chaos Testing(混沌测试)。如下图,在此环境中,须要看看 worker node B 在非正常关机或网络分区的状况下,对外部用户(Client) 的影响:
- 申请成功率影响
- 性能 (TPS/Response Time) 影响
上图须要阐明一下:
- 对外部的 VIP(虚构 IP) 的 TCP/IP 层负载平衡,是通过 ECMP(Equal-Cost Multi-Path) 的 Modulo-N 算法,调配负载,它实质上就是用 TCP 连贯的 5 元组 (协定、srcIP、srcPort、dstIP、dstPort) 去调配内部流量了。留神,这种负载平衡算法是
无状态
的,在指标数量发生变化时,负载平衡算法的后果也会发生变化。即是不稳固算法
。 - dstIP 为 VIP 的 TCP 流量,来到 woker node 后,再由 ipvs/conntrack 规定做有状态的,DNAT,dstIP 被映射和转换为任意一个 Istio Gateway POD 的地址。留神,这种负载平衡算法是
有状态
的,在指标数量发生变化时,原有连贯的负载平衡后果不会发生变化。即算是稳固算法
。 -
Istio Gateway POD 对 HTTP/TCP 流量也做了负载平衡。两种协定的区别是:
- 对于 HTTP。同一 downstream(流量收回方) 的一个连贯的多个申请,可能被负载平衡到不同的 upstream(流量指标)
- 对于 TCP。同一 downstream(流量收回方) 的一个连贯的多个数据包,会被负载平衡到同一 upstream(流量指标)
开始测试
Chaos Testing 的办法是暴力敞开 worker node B。如上图,能够推断出 红色
与绿色
线的连贯,都会间接影响到。从客户端看到的影响是:
- 申请成功率只升高了 0.01%
- TPS 升高了 1/2,继续了半小时后,才复原回来。
- Avg Response Time(均匀响应工夫) 根本不变
须要留神的是,单个 Worker Node 的各类资源不是这个测试的性能瓶颈。那么,问题呈现在什么中央?
客户端是个 JMeter 程序,通过细看其产生的测试报告,发现 worker node 敞开后,Avg Response Time
是变动不大。但 P99 与 MAX 的 Response Time 变得异样地大。可见,Avg Response Time
这货色暗藏了很多货色,测试的线程,很可能是 Block(阻塞)在什么中央了,才导致 TPS 降落。
通过一翻折腾,起初批改了 内部客户端
的 JMeter 的超时工夫为 6s,问题解决。worker node 敞开后,TPS 疾速复原。
寻根
内部客户端的问题解决了。就能够出工开饭了。但作为一个爱折腾的人,我想找寻其起因。更想晓得,这个状况是真疾速复原了,还是暗藏也什么隐患。
开始前先讲一个概念:
TCP half-open
📖 TCP half-open
依据 RFC 793,当 TCP 连贯一端的主机解体,或者在没有告诉另一端的状况下删除了套接字时,TCP 连贯被称为
半关上
。如果半关上端闲暇(即无数据 /keepalive 发送),则连贯可能会在有限时间段内放弃半关上状态。
在 worker node B 敞开后,从 内部客户端
的角度看,如上图,其到 worker node B 的 TCP 连贯可能处于两种状态:
-
client kernel 层因为发送(或重发) 数据、或闲置达到 keepalive 工夫,须要发送数据包到对端。worker node A 收到这个数据包,因为是不非法的 TCP,所以可能的状况是:
- 响应了 TCP RESET。client 收到后敞开了连贯。client Block(阻塞)在 socket 的线程也因连贯被敞开而返回,持续运行且敞开 socket
- 因为 DNAT 映射表找不到相干的连贯,数据包间接 drop 了,不响应。client Block 在 socket 的线程持续 Block。即产生了
TCP half-open
- client 连贯没启用 keepalive,或闲置未达到 keepalive 工夫,内核层也无数据须要发送(或重发),client 线程 Block 在 socket read 期待,即产生了
TCP half-open
能够看到,对于 client 来说,在很大概率下,要发现一个连贯曾经生效了,均须要肯定的工夫。在最差的状况下,如没启动 keepalive,可能永远发现不了 TCP half-open
。
keepalive
来自 [TCP/IP Illustrated Volume 1]
keepalive 探测是一个空的(或 1 字节)
segment(段)
,其序列号比迄今为止从对端 (peer)
看到的最大ACK
号小 1。因为这个序列号曾经被peer
接管,peer
再收到这个空segment
不会有任何副作用,但它会引发一个peer
返回一个ACK
,用于确定peer
是否存活。探测 probe segment
及其ACK
均不蕴含任何新数据。
探测 probe segment
如果失落,TCP 也不会从新传输。[RFC1122] 规定,因为这一事实,单个keepalive
探测收不到ACK
不应被视为对端已死的充沛证据。须要屡次距离探测。
如果 socket 关上了 SO_KEEPALIVE
,那么就是启用了 keepalive
。
对于启用了 keepalive
的 TCP 连贯,Linux 有如下全局默认配置:
https://www.kernel.org/doc/ht…
-
tcp_keepalive_time – INTEGER
How often TCP sends out keepalive messages when keepalive is enabled. Default: 2 hours.
-
tcp_keepalive_probes – INTEGER
How many keepalive probes TCP sends out, until it decides that the connection is broken. Default value: 9.
-
tcp_keepalive_intvl – INTEGER
How frequently the probes are send out. Multiplied by tcp_keepalive_probes it is time to kill not responding connection, after probes started. Default value: 75 sec i.e. connection will be aborted after ~11 minutes of retries.
同时,Linux 也提供了为每个 socket 独立指定的配置项:
https://man7.org/linux/man-pa…
TCP_KEEPCNT (since Linux 2.4)
The maximum number of keepalive probes TCP should send
before dropping the connection. This option should not be
used in code intended to be portable.
TCP_KEEPIDLE (since Linux 2.4)
The time (in seconds) the connection needs to remain idle
before TCP starts sending keepalive probes, if the socket
option SO_KEEPALIVE has been set on this socket. This
option should not be used in code intended to be portable.
TCP_KEEPINTVL (since Linux 2.4)
The time (in seconds) between individual keepalive probes.
This option sh
能够计算,默认状况 下,一个连贯的最快被 keepalive 敞开的时长:
TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT = 2*60*60 + 75*9 = 2 小时 + 11 分钟
重传 timeout
https://www.kernel.org/doc/Do…
- tcp_retries2 - INTEGER
This value influences the timeout of an alive TCP connection, when RTO retransmissions remain unacknowledged. Given a value of N, a hypothetical TCP connection following exponential backoff with an initial RTO of TCP_RTO_MIN would retransmit N times before killing the connection at the (N+1)th RTO.The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.RFC 1122 recommends at least 100 seconds for the timeout, which corresponds to a value of at least 8.
下面配置项,配置重传状态下,要指数让步多少次重传,内核才敞开连贯。默认的配置是 15。计算转换成工夫约是 924s,约 15 分钟。
Zero window timeout
当 对端
通告其窗口大小为零时,这表明对端 TCP 接收缓冲区已满,无奈接管更多数据。它可能是因为对端资源缓和而数据处理太慢,最终导致 TCP 接收缓冲区被填满。
实践上,对端在解决完接管窗口中沉积的数据后,会用 ACK 来告诉窗口凋谢。但因各种起因,有时候,这个 ACK 会失落。
所以,有数据未收回的发送方须要定期探测窗口大小。发送方会从未送达的缓存中,抉择头一个字节数据发送作为探测包。当探测超过肯定次数,对方不响应,或始终响应 0 窗口时,连贯会主动敞开。Linux 中默认是 15 次。配置项是:tcp_retries2
。它的探测重试机制和 TCP 重传是相似的。
参考:https://blog.cloudflare.com/w…,Zero%20window,-ESTAB%20is…%20forever
利用 socket 层的 timeout 设置
TCP_USER_TIMEOUT
man tcp
TCP_USER_TIMEOUT (since Linux 2.6.37)
This option takes an unsigned int as an argument. When
the value is greater than 0, it specifies the maximum
amount of time in milliseconds that transmitted data may
remain unacknowledged, or bufferred data may remain
untransmitted (due to zero window size) before TCP will
forcibly close the corresponding connection and return
ETIMEDOUT to the application. If the option value is
specified as 0, TCP will use the system default.
Increasing user timeouts allows a TCP connection to
survive extended periods without end-to-end connectivity.
Decreasing user timeouts allows applications to "fail
fast", if so desired. Otherwise, failure may take up to
20 minutes with the current system defaults in a normal
WAN environment.
This option can be set during any state of a TCP
connection, but is effective only during the synchronized
states of a connection (ESTABLISHED, FIN-WAIT-1, FIN-
WAIT-2, CLOSE-WAIT, CLOSING, and LAST-ACK). Moreover,
when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will override keepalive to determine when
to close a connection due to keepalive failure.
The option has no effect on when TCP retransmits a packet,
nor when a keepalive probe is sent.
This option, like many others, will be inherited by the
socket returned by accept(2), if it was set on the
listening socket.
Further details on the user timeout feature can be found
in RFC 793 and RFC 5482 ("TCP User Timeout Option").
即,指定在发送得不到确认(收不到 ACK
),或对端接管窗口为 0 多久后,内核才敞开连贯并返回谬误给利用。
须要留神的是,TCP_USER_TIMEOUT
会影响 keepalive 的 TCP_KEEPCNT
配置成果:
https://blog.cloudflare.com/w…
With
TCP_USER_TIMEOUT
set, theTCP_KEEPCNT
is totally ignored. If you wantTCP_KEEPCNT
to make sense, the only sensibleUSER_TIMEOUT
value is slightly smaller than:TCP_USER_TIMEOUT < TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT
SO_RCVTIMEO / SO_SNDTIMEO
https://man7.org/linux/man-pa…
SO_RCVTIMEO and SO_SNDTIMEO
Specify the receiving or sending timeouts until reporting
an error. The argument is a struct timeval. If an input
or output function blocks for this period of time, and
data has been sent or received, the return value of that
function will be the amount of data transferred; if no
data has been transferred and the timeout has been
reached, then -1 is returned with errno set to EAGAIN or
EWOULDBLOCK, or EINPROGRESS (for connect(2)) just as if
the socket was specified to be nonblocking. If the
timeout is set to zero (the default), then the operation
will never timeout. Timeouts only have effect for system
calls that perform socket I/O (e.g., read(2), recvmsg(2),
send(2), sendmsg(2)); timeouts have no effect for
select(2), poll(2), epoll_wait(2), and so on.
须要留神的是,本例中,咱们的 client 是 JMeter,是 java 实现的,他用了 socket.setSoTimeout
办法来设置超时。依据:
https://stackoverflow.com/que…
和我看到的源码,Linux 实现上应该是用了下一节阐明的 select/poll 的 timeout 参数,而不是下面的 socket Options。
https://github.com/openjdk/jd…
Java JMeter 在 catch 到 SocketTimeoutException 后,就被动 close 了 socket。并重连,所以死 socket 的问题是在应用层解决了。
poll timeout
https://man7.org/linux/man-pa…
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
寻根总结
参考:https://blog.cloudflare.com/w…
要保障连贯在各种状态下均能够比拟快地检测出超时的状况:
- 启用
TCP keepalive
,并配置正当的工夫。这是在闲暇连贯状况下放弃一些数据流动所必须的。 - 将
TCP_USER_TIMEOUT
设置为TCP_KEEPIDLE
+TCP_KEEPINTVL
*TCP_KEEPCNT
。 - 在应用层用读写超时检测,并在超时后利用被动敞开连贯。(这是本文的状况)
为何有 TCP keepalive
了,还要 TCP_USER_TIMEOUT
? 起因是如果产生网络分区,重传状态下的连贯,是不会触发 keepalive 探测的。我将原理记录到下图:
较真有什么用
🤔 ❓ 说到这里,有同学会问,说到底,这次,你就是调整了个应用层的读超时就行了。钻研和较真那么多其它的干嘛?
这时,咱们回到下图的“初心”来,看看是不是所有隐患都解决了:
很显著,只解决了 External Client
到 k8s worker node B
的红线局部。其它红、绿线,没考察过。这些 tcp half-opent
连贯,是用 tcp keepalive
、tcp retransmit timeout
、利用(Envoy) 层 timeout
机制疾速敞开了,还是长期未检测到问题而敞开不及时,甚至是连贯透露(connection leak)?
闲暇连贯的 keepalive 查看
作为 upstream(服务端) 时
以下可见,Istio gateway 默认无启用 keepalive:
$ kubectl exec -it $ISTIO_GATEWAY_POD -- ss -oipn 'sport 15001 or sport 15001 or sport 8080 or sport 8443'
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
tcp ESTAB 0 0 192.222.46.71:8080 10.111.10.101:51092 users:(("envoy",pid=45,fd=665))
sack cubic wscale:11,11 rto:200 rtt:0.064/0.032 mss:8960 pmtu:9000 rcvmss:536 advmss:8960 cwnd:10 segs_in:2 send 11200000000bps lastsnd:31580 lastrcv:31580 lastack:31580 pacing_rate 22400000000bps delivered:1 rcv_space:62720 rcv_ssthresh:56576 minrtt:0.064
这时,能够用 EnvoyFilter 加上 keepalive:
参考:
https://support.f5.com/csp/ar…
https://www.envoyproxy.io/doc…
https://github.com/istio/isti…
https://istio-operation-bible…
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: ingress-gateway-socket-options
namespace: istio-system
spec:
configPatches:
- applyTo: LISTENER
match:
context: GATEWAY
listener:
name: 0.0.0.0_8080
portNumber: 8080
patch:
operation: MERGE
value:
socket_options:
- description: enable keep-alive
int_value: 1
level: 1
name: 9
state: STATE_PREBIND
- description: idle time before first keep-alive probe is sent
int_value: 7
level: 6
name: 4
state: STATE_PREBIND
- description: keep-alive interval
int_value: 5
level: 6
name: 5
state: STATE_PREBIND
- description: keep-alive probes count
int_value: 2
level: 6
name: 6
state: STATE_PREBIND
istio-proxy sidecar 也能够用相似的办法设置。
作为 downstream(客户端) 时
参考:https://istio.io/latest/docs/…
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: bookinfo-redis
spec:
host: myredissrv.prod.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
connectTimeout: 30ms
tcpKeepalive:
time: 60s
interval: 20s
probes: 4
TCP_USER_TIMEOUT
故事说到这里,应该完结了,但,还没有。回顾一下之前的两个图:
这时,retransmit timer 会定时在 TCP 层作重传。这里有两个可能性:
- Calico 在 worker node B 断电后,疾速发现问题,更新了 worker node A 的路由表,删除了到 worker node B 的路由。
- 未及时更新路由
而默认的 retransmit timer,须要 15 分钟才会敞开连贯并告诉利用。如何放慢?
能够用上文提到的 TCP_USER_TIMEOUT
减速 half-open TCP
在重传状态下发现问题 :
https://github.com/istio/isti…
https://github.com/istio/isti…
kind: EnvoyFilter
metadata:
name: sampleoptions
namespace: istio-system
spec:
configPatches:
- applyTo: CLUSTER
match:
context: SIDECAR_OUTBOUND
cluster:
name: "outbound|12345||foo.ns.svc.cluster.local"
patch:
operation: MERGE
value:
upstream_bind_config:
source_address:
address: "0.0.0.0"
port_value: 0
protocol: TCP
socket_options:
- name: 18 #TCP_USER_TIMEOUT
int_value: 10000
level: 6
下面减速了 die upstream(服务端解体) 的发现,对于 die downstream,可能能够用相似的办法,配置在 listener。
Envoy 应用层衰弱检测
故事说到这里,真应该完结了,但,还没有。
应用层的衰弱检测,也可能能够减速发现 upstream cluster 的 TCP half-open
,或者说是 endpoint outlier
问题。<mark> 留神,这里的衰弱检测,不是 k8s 的 liveness/readiness probe
。是 pod 到 pod 间的衰弱检测,包含 pod 与 pod 间的连通性。</mark>
Envoy 有两种衰弱检测:
- 被动衰弱检测: Health checking
- 被动衰弱检测: Outlier detection
衰弱检测与连接池
见:Health checking interactions
如果 Envoy 配置为被动或被动健康检查,则所有 从 可用状态
转换为 不可用状态
的主机
的连接池
都将敞开。如果主机痊愈从新进入负载平衡,它将创立新的连贯,这将最大限度地解决死连贯的问题(因为 ECMP 路由或其余起因)。
衰弱检测与 endpoint 发现
见:On eventually consistent service discovery
被动衰弱检测: Health checking
https://www.envoyproxy.io/doc…
被动衰弱检测: Outlier detection
https://istio.io/latest/docs/…
https://www.envoyproxy.io/doc…
https://www.envoyproxy.io/doc…
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: httpbin
spec:
host: httpbin
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 1 #The maximum number of requests that will be queued while waiting for a ready connection pool connection
outlierDetection:
consecutive5xxErrors: 1
interval: 1s
baseEjectionTime: 3m
maxEjectionPercent: 100
EOF
衰弱检测与 EDS,听谁的?
当 worker node B
断电后,其中运行的 pod 的状态最终(默认大略 10 分钟)到了 Terminaling
。k8s 会告诉 istiod 去删除这个 endpoint。那么问题来了,到底,是 EDS 快,还是 health check 检测到生效快,Envoy 以哪个数据为负载抉择的根据?
这个问题在这个文档中,有一些探讨:
On eventually consistent service discovery
Envoy was designed from the beginning with the idea that service discovery does not require full consistency. Instead, Envoy assumes that hosts come and go from the mesh in an eventually consistent way. Our recommended way of deploying a service to service Envoy mesh configuration uses eventually consistent service discovery along with active health checking (Envoy explicitly health checking upstream cluster members) to determine cluster health. This paradigm has a number of benefits:
- All health decisions are fully distributed. Thus, network partitions are gracefully handled (whether the application gracefully handles the partition is a different story).
- When health checking is configured for an upstream cluster, Envoy uses a 2×2 matrix to determine whether to route to a host:
Discovery Status(服务发现状态) Health Check OK Health Check Failed Discovered Route(参加负载平衡) Don’t Route Absent(缺失) <mark>Route(参加负载平衡)</mark> Don’t Route / Delete
Host discovered / health check OK
Envoy will route to the target host.
<mark>Host absent / health check OK:</mark>
Envoy will route to the target host. This is very important since the design assumes that the discovery service can fail at any time. If a host continues to pass health check even after becoming absent from the discovery data, Envoy will still route. Although it would be impossible to add new hosts in this scenario, existing hosts will continue to operate normally. When the discovery service is operating normally again the data will eventually re-converge.
Host discovered / health check FAIL
Envoy will not route to the target host. Health check data is assumed to be more accurate than discovery data.
Host absent / health check FAIL
Envoy will not route and will delete the target host. This is the only state in which Envoy will purge host data.
有一点没齐全了解的是,这里的
Absent
是指 EDS 服务拜访失败时 Absent,还是拜访胜利了,但后果中没再呈现一个原有的 endpoint。
回顾一下之前的图:
大略能够晓得,什么中央能够思考用 health check 配置来减速问题发现。
Envoy 应用层的超时
Envoy 应用层的连贯级超时
- 新建连贯超时:connect_timeout,Istio 默认 10s,这个配置影响到 outlier detection 的时效。
-
闲暇连贯超时:idle_timeout,默认 1 小时
- Istio destination-rule
- 最大连贯时长:max_connection_duration,默认有限
Envoy 应用层的申请级超时
对 downstream(client) 的申请读超时
Envoy:
- Envoy 应用层的申请接管超时:request_timeout,默认有限
- header 读超时:request_headers_timeout,默认有限
- 更多见:https://www.envoyproxy.io/doc…
对 upstream(server) 的响应期待超时
即从残缺读取 downsteam 的 request 后开始算,到从 upstream 残缺读取 response 的工夫。见:
Envoy Route timeouts
https://istio.io/latest/docs/…
谈超时,别忘记 retry 影响
见:Istio retry
思考
如果断电的 worker node 重新启动,以前的对端能够疾速收到 TCP RST 而断开生效连贯吗?
如果断电的 worker node 的连贯解决链路上有 NAT/conntrack
参加,会话与端口映射状态失落后,肯定会返回 TCP RST 吗?还是 drop 了?
一点总结
本文写得有点乱。说实话,有的配置和原理互相关联与影响,要齐全梳理难度很大:
- TCP 层的各种 timeout
- syscall 的各种 timeout
- 应用层的各种 timeout
- Health check 与 Outlier detection
- Retry
心愿有一天,有人(或者本人)能够说分明这些事件。本文的指标是把事件的变量都先记录下来,确定一个范畴,再去细分斟酌原理。心愿对读者有用 🥂
次要参考
- https://blog.cloudflare.com/w…
- https://codearcana.com/posts/…
- https://www.evanjones.ca/tcp-…