关于istio:EnvoyIstio-连接生命周期与临界异常-不知所谓的连接-REST

简介

本文指标：阐明 Envoy 连贯管制相干参数作用。以及在临界异常情况下的细节逻辑。指标是如何缩小连贯异样而引起的服务拜访失败，进步服务成功率。

近期为解决一个生产环境中的 Istio Gateway 连贯偶然 Reset 问题，钻研了一下 Envoy/Kernel 在 socket 连贯敞开上的事。其中包含 Envoy 的连贯治理相干参数和 Linux 零碎网络编程的细节。写本文以备忘。

原文：https://blog.mygraphql.com/zh/posts/cloud/envoy/connection-life/

引

封面简介：《硅谷(Silicon Valley)》

《硅谷》是一部美国悲剧电视连续剧，由迈克·贾奇、约翰·阿尔舒勒和戴夫·克林斯基创作。它于 2014 年 4 月 6 日在 HBO 首播，并于 2019 年 12 月 8 日完结，共 53 集。该系列模拟了硅谷的技术行业文化，重点讲述了创立一家名为 Pied Piper 的初创公司的程序员 Richard Hendricks，并记录了他在面对来自更大实体的竞争时维持公司的致力。

该系列因其写作和风趣而受到好评。它取得了有数荣誉提名，包含间断五次取得黄金时段艾美奖卓越悲剧系列提名。

我在 2018 年看过这部连续剧。过后英文程度无限，能听懂的只有 F*** 的单词。但透过屏幕，还是能够感触到一群有守业激情的如何各展所能，去应答一个又一个挑战的过程。在某种程度上，满足我这种事实世界无奈实现的欲望。

剧中一个经典场景是玩一种能够抛在地面后，随机变红或蓝的玩具球。玩家失去蓝色，就算赢。叫：

SWITCH PITCH BALL

Based on a patented ‘inside-out’ mechanism, this lightweight ball changes colors when it is flipped in the air. The Switch Pitch entered the market in 2001 and has been in continuous production since then. The toy won the Oppenheim Platinum Award for Design and has been featured numerous times on HBO’s Silicon Valley.

好了，写篇技术文章须要那么长的引子吗？是的，这文章有点长和干燥，正所谓 TL;DR 。

大家晓得，所有重网络的利用，包含 Envoy 在内，都有玩随机 SWITCH PITCH BALL 的时候。随机熵能够来源于一个特地慢的对端，能够来源于特地小的网络 MTU，或者是特地大的 HTTP Body，一个特地长的 Http Keepalive 连贯，甚至一个实现不标准的 Http Client。

Envoy 连贯生命周期治理

摘自我的：https://istio-insider.mygraphql.com/zh_CN/latest/ch2-envoy/connection-life/connection-life.html

Upstream/Downstream 连贯解藕

HTTP/1.1 标准有这个设计：
HTTP Proxy 是 L7 层的代理，应该和 L3/L4 层的连贯生命周期离开。

所以，像从 Downstream 来的 Connection: Close 、 Connection: Keepalive 这种 Header， Envoy 不会 Forward 到 Upstream 。 Downstream 连贯的生命周期，当然会听从 Connection: xyz 的批示管制。但 Upstream 的连贯生命周期不会被 Downstream 的连贯生命周期影响。即，这是两个独立的连贯生命周期治理。

Github Issue: HTTP filter before and after evaluation of Connection: Close header sent by upstream#15788 阐明了这个问题：
This doesn’t make sense in the context of Envoy, where downstream and upstream are decoupled and can use different protocols. I’m still not completely understanding the actual problem you are trying to solve?

连贯超时相干配置参数

图：Envoy 连贯 timeout 时序线

用 Draw.io 关上

idle_timeout

(Duration) The idle timeout for connections. The idle timeout is defined as the period in which there are no active requests. When the idle timeout is reached the connection will be closed. If the connection is an HTTP/2 downstream connection a drain sequence will occur prior to closing the connection, see drain_timeout. Note that request based timeouts mean that HTTP/2 PINGs will not keep the connection alive. If not specified, this defaults to 1 hour. To disable idle timeouts explicitly set this to 0.

Warning

Disabling this timeout has a highly likelihood of yielding connection leaks due to lost TCP FIN packets, etc.

If the overload action “envoy.overload\_actions.reduce\_timeouts” is configured, this timeout is scaled for downstream connections according to the value for HTTP\_DOWNSTREAM\_CONNECTION\_IDLE.

max_connection_duration

(Duration) The maximum duration of a connection. The duration is defined as a period since a connection was established. If not set, there is no max duration. When max_connection_duration is reached and if there are no active streams, the connection will be closed. If the connection is a downstream connection and there are any active streams, the drain sequence will kick-in, and the connection will be force-closed after the drain period. See drain\_timeout.

Github Issue: http: Allow upper bounding lifetime of downstream connections #8302

Github PR: add max_connection_duration: http conn man: allow to upper-bound downstream connection lifetime. #8591

Github PR: upstream: support max connection duration for upstream HTTP connections #17932

Github Issue: Forward Connection:Close header to downstream#14910
For HTTP/1, Envoy will send a Connection: close header after max_connection_duration if another request comes in. If not, after some period of time, it will just close the connection.

https://github.com/envoyproxy…

Note that max_requests_per_connection isn’t (yet) implemented/supported for downstream connections.

For HTTP/1, Envoy will send a Connection: close header after max_connection_duration （且在 drain_timeout 前） if another request comes in. If not, after some period of time, it will just close the connection.

I don’t know what your downstream LB is going to do, but note that according to the spec, the Connection header is hop-by-hop for HTTP proxies.

max_requests_per_connection

(UInt32Value) Optional maximum requests for both upstream and downstream connections. If not specified, there is no limit. Setting this parameter to 1 will effectively disable keep alive. For HTTP/2 and HTTP/3, due to concurrent stream processing, the limit is approximate.

Github Issue: Forward Connection:Close header to downstream#14910

We are having this same issue when using istio (istio/istio#32516). We are migrating to use istio with envoy sidecars frontend be an AWS ELB. We see that connections from ELB -> envoy stay open even when our application is sending Connection: Close. max_connection_duration works but does not seem to be the best option. Our applications are smart enough to know when they are overloaded from a client and send Connection: Close to shard load.

I tried writing an envoy filter to get around this but the filter gets applied before the stripping. Did anyone discover a way to forward the connection close header?

drain_timeout – for downstream only

Envoy Doc%20The)

(Duration) The time that Envoy will wait between sending an HTTP/2 “shutdown notification” (GOAWAY frame with max stream ID) and a final GOAWAY frame. This is used so that Envoy provides a grace period for new streams that race with the final GOAWAY frame. During this grace period, Envoy will continue to accept new streams.

After the grace period, a final GOAWAY frame is sent and Envoy will start refusing new streams. Draining occurs both when:

a connection hits the idle timeout
- 即系连贯达到 idle_timeout 或 max_connection_duration后，都会开始 draining 的状态和drain_timeout计时器。对于 HTTP/1.1，在 draining 状态下。如果 downstream 过去申请，Envoy 都在响应中退出 Connection: close header。
- 所以只有连贯产生 idle_timeout 或 max_connection_duration后，才会进入 draining 的状态和drain_timeout计时器。
or during general server draining.

The default grace period is 5000 milliseconds (5 seconds) if this option is not specified.

https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining

By default, the HTTP connection manager filter will add “Connection: close” to HTTP1 requests(笔者注：By HTTP Response), send HTTP2 GOAWAY, and terminate connections on request completion (after the delayed close period).

我已经认为， drain 只在 Envoy 要 shutdown 时才触发。当初看来，只有是有打算的敞开连贯（连贯达到 idle_timeout 或 max_connection_duration后），都应该走 drain 流程。

delayed_close_timeout – for downstream only

(Duration) The delayed close timeout is for downstream connections managed by the HTTP connection manager. It is defined as a grace period after connection close processing has been locally initiated during which Envoy will wait for the peer to close (i.e., a TCP FIN/RST is received by Envoy from the downstream connection) prior to Envoy closing the socket associated with that connection。

即系在一些场景下，Envoy 会在未齐全读取完 HTTP Request 前，就回写 HTTP Response 且心愿敞开连贯。这叫 服务端过早敞开连贯(Server Prematurely/Early Closes Connection)。这时有几种可能状况：

downstream 还在发送 HTTP Reqest 当中(socket write)。
或者是 Envoy 的 kernel 中，还有 socket recv buffer 未被 Envoy user-space 进取。通常是 HTTP Conent-Lentgh 大小的 BODY 还在内核的 socket recv buffer 中，未残缺加载到 Envoy user-space

这两种状况下，如果 Envoy 调用 close(fd) 去敞开连贯， downstream 均可能会收到来自 Envoy kernel 的 RST 。最终 downstream 可能不会 read socket 中的 HTTP Response 就间接认为连贯异样，向下层报告异样：Peer connection rest。

详见：{doc}connection-life-race 。

为缓解这种状况，Envoy 提供了延后敞开连贯的配置。心愿期待 downstream 实现 socket write 的过程。让 kernel socket recv buffer 数据都加载到 user space 中。再去调用 close(fd)。

NOTE: This timeout is enforced even when the socket associated with the downstream connection is pending a flush of the write buffer. However, any progress made writing data to the socket will restart the timer associated with this timeout. This means that the total grace period for a socket in this state will be <total_time_waiting_for_write_buffer_flushes>+<delayed_close_timeout>.

即系，每次 write socket 胜利，这个 timer 均会被 rest.

Delaying Envoy’s connection close and giving the peer the opportunity to initiate the close sequence mitigates(缓解) a race condition that exists when downstream clients do not drain/process data in a connection’s receive buffer after a remote close has been detected via a socket write(). 即系，能够缓解 downsteam 在 write socket 失败后，就不去 read socket 取 Response 的状况。

This race leads to such clients failing to process the response code sent by Envoy, which could result in erroneous downstream processing.

If the timeout triggers, Envoy will close the connection’s socket.

The default timeout is 1000 ms if this option is not specified.

Note:

To be useful in avoiding the race condition described above, this timeout must be set to at least <max round trip time expected between clients and Envoy>+<100ms to account for a reasonable “worst” case processing time for a full iteration of Envoy’s event loop>.

Warning:

A value of 0 will completely disable delayed close processing. When disabled, the downstream connection’s socket will be closed immediately after the write flush is completed or will never close if the write flush does not complete.

须要留神的是，为了不影响性能，delayed_close_timeout 在很多状况下是不会失效的：

Github PR: http: reduce delay-close issues for HTTP/1.1 and below #19863

Skipping delay close for:

HTTP/1.0 framed by connection close (as it simply reduces time to end-framing)

as well as HTTP/1.1 if the request is fully read (so there’s no FIN-RST race)。即系如果

Addresses the Envoy-specific parts of #19821
Runtime guard: envoy.reloadable_features.skip_delay_close

同时呈现在 Envoy 1.22.0 的 Release Note 里：

http: avoiding delay-close for:

HTTP/1.0 responses framed by connection: close

as well as HTTP/1.1 if the request is fully read.

This means for responses to such requests, the FIN will be sent immediately after the response. This behavior can be temporarily reverted by setting envoy.reloadable_features.skip_delay_close to false. If clients are seen to be receiving sporadic partial responses and flipping this flag fixes it, please notify the project immediately.

Envoy 连贯敞开后的竞态条件

摘自我的：https://istio-insider.mygraphql.com/zh_CN/latest/ch2-envoy/connection-life/connection-life-race.html

因为上面应用到了 socket 一些比拟底层和冷门的知识点。如 close socket 的一些临界状态和异样逻辑。如果不太理解，倡议先浏览我写的：

《Mark’s DevOps 雜碎》中《Socket Close/Shutdown 的临界状态与异样逻辑》一文。

Envoy 与 Downstream/Upstream 连贯状态不同步

以下大部分状况，算是个产生可能性低的 race condition。但，在大流量下，再少的可能性也是有遇到的时候。Design For Failure 是程序员的天职。

Downstream 向 Envoy 敞开中的连贯发送申请

Github Issue: 502 on our ALB when traffic rate drops#13388
Fundamentally, the problem is that ALB is reusing connections that Envoy is closing. This is an inherent(固有) race condition with HTTP/1.1.
You need to configure the ALB max connection / idle timeout to be < any envoy timeout.

To have no race conditions, the ALB needs to support max_connection_duration and have that be less than Envoy’s max connection duration. There is no way to fix this with Envoy.

实质上是：

Envoy 调用 close(fd) 敞开了 socket。同时敞开了 fd。
- 如果 close(fd) 时：
  - kernel 的 socket recv buffer 有数据未加载到 user-space ，那么 kernel 会发送 RST 给 downstream。起因是这数据是曾经 TCP ACK 过的，而利用却抛弃了。
  - 否则，kernel 发送 FIN 给 downstream.
- 因为敞开了 fd，这注定了如果 kernel 还在这个 TCP 连贯上收到 TCP 数据包，就会抛弃且以 RST 回应。
Envoy 收回了 FIN
Envoy socket kernel 状态更新为 FIN_WAIT_1 或 FIN_WAIT_2。

对于 Downstream 端，有两种可能：

Downstream 所在 kernel 中的 socket 状态曾经被 Envoy 发过来的 FIN 更新为 CLOSE_WAIT 状态，但 Downstream 程序(user-space)中未更新（即未感知到 CLOSE_WAIT 状态）。
Downstream 所在 kernel 因网络提早等问题，还未收到 FIN。

所以 Downstream 程序 re-use 了这个 socket ，并发送 HTTP Request(假如拆分为多个 IP 包) 。后果都是在某个 IP 包达到 Envoy kernel 时，Envoy kernel 返回了 RST。于是 Downstream kernel 在收到 RST 后，也敞开了socket。所以从某个 socket write 开始均会失败。失败阐明是相似 Upstream connection reset. 这里须要留神的是， socket write() 是个异步的过程，不会期待对端的 ACK 就返回了。

一种可能是，某个 write() 时发现失败。这更多是 http keepalive 的 http client library 的行为。或者是 HTTP Body 远远大于 socket sent buffer 时，分多 IP 包的行为。
一种可能是，直到 close() 时，要期待 ACK 了，才发现失败。这更多是非 http keepalive 的 http client library 的行为。或者是 http keepalive 的 http client library 的最初一个申请时的行为。

从 HTTP 层面来看，有两种场景可能呈现这个问题：

服务端过早敞开连贯(Server Prematurely/Early Closes Connection)。

Downsteam 在 write HTTP Header 后，再 write HTTP Body。然而，Envoy 在未读完 HTTP Body 前，就曾经 Write Response 且 close(fd) 了 socket。这叫 服务端过早敞开连贯(Server Prematurely/Early Closes Connection)。别以为 Envoy 不会呈现未齐全读完 Request 就 write Response and close socket 的状况。起码有几个可能性：
- 只须要 Header 就能够判断一个申请是非法的。所以大部分是返回 4xx/5xx 的 status code。
- HTTP Request Body 超过了 Envoy 的最大限度 max_request_bytes
这时，有两个状况：
- Downstream 的 socket 状态可能是 CLOSE_WAIT。还能够 write() 的状态。但这个 HTTP Body 如果被 Envoy 的 Kernel 收到，因为 socket 曾经执行过 close(fd) ， socket 的文件 fd 曾经敞开，所以 Kernel 间接抛弃 HTTP Body 且返回 RST 给对端（因为 socket 的文件 fd 曾经敞开，曾经没过程可能读取到数据了）。这时，可怜的 Downstream 就会说：Connection reset by peer 之类的谬误。
- Envoy 调用 close(fd) 时，kernel 发现 kernel 的 socket buffer 未被 user-space 齐全生产。这种状况下， kernel 会发送 RST 给 Downstream。最终，可怜的 Downstream 就会在尝试 write(fd) 或 read(fd) 时说：Connection reset by peer 之类的谬误。
  见：Github Issue: http: not proxying 413 correctly#2929
```
+----------------+      +-----------------+
|Listner A (8000)|+---->|Listener B (8080)|+----> (dummy backend)
+----------------+      +-----------------+
```
  This issue is happening, because Envoy acting as a server (i.e. listener B in @lizan’s example) closes downstream connection with pending (unread) data, which results in TCP RST packet being sent downstream.
  
  Depending on the timing, downstream (i.e. listener A in @lizan’s example) might be able to receive and proxy complete HTTP response before receiving TCP RST packet (which erases low-level TCP buffers), in which case client will receive response sent by upstream (413 Request Body Too Large in this case, but this issue is not limited to that response code), otherwise client will receive 503 Service Unavailable response generated by listener A (which actually isn’t the most appropriate response code in this case, but that’s a separate issue).
  
  The common solution for this problem is to half-close downstream connection using ::shutdown(fd_, SHUT_WR) and then read downstream until EOF (to confirm that the other side received complete HTTP response and closed connection) or short timeout.

缩小这种 race condition 的可行办法是：delay close socket。 Envoy 曾经有相干的配置：delayed_close_timeout%20The)

Downstream 未感知到 HTTP Keepalive 的 Envoy 连贯曾经敞开，re-use 了连贯。

下面提到的 Keepalive 连贯复用的时候。Envoy 曾经调用内核的 close(fd) 把 socket 变为 FIN_WAIT_1/FIN_WAIT_2 的状态，且曾经收回 FIN。但 Downstream 未收到，或曾经收到但利用未感知到，且同时 reuse 了这个 http keepalive 连贯来发送 HTTP Request。在 TCP 协定层面看来，这是个 half-close 连贯，未 close 的一端确实是能够发数据到对端的。但曾经调用过 close(fd) 的 kernel (Envoy端) 在收到数据包时，间接抛弃且返回 RST 给对端（因为 socket 的文件 fd 曾经敞开，曾经没过程可能读取到数据了）。这时，可怜的 Downstream 就会说：Connection reset by peer 之类的谬误。
- 缩小这种 race condition 的可行办法是：让 Upstream 对端配置比 Envoy 更小的 timeout 工夫。让 Upsteam 被动敞开连贯。

Envoy 实现上的缓解

缓解服务端过早敞开连贯(Server Prematurely/Early Closes Connection)

Github Issue: http: not proxying 413 correctly #2929

In the case envoy is proxying large HTTP request, even upstream returns 413, the client of proxy is getting 503.

Github PR: network: delayed conn close #4382，减少了 delayed_close_timeout 配置。

Mitigate client read/close race issues on downstream HTTP connections by adding a new connection
close type ‘FlushWriteAndDelay‘. This new close type flushes the write buffer on a connection but
does not immediately close after emptying the buffer (unlike ConnectionCloseType::FlushWrite).

A timer has been added to track delayed closes for both ‘FlushWrite‘ and ‘FlushWriteAndDelay‘. Upon
triggering, the socket will be closed and the connection will be cleaned up.

Delayed close processing can be disabled by setting the newly added HCM ‘delayed_close_timeout‘
config option to 0.

Risk Level: Medium (changes common case behavior for closing of downstream HTTP connections)
Testing: Unit tests and integration tests added.

但下面的 PR 在缓解了问题的同时也影响了性能：

Github Issue: HTTP/1.0 performance issues #19821

I was about to say it’s probably delay-close related.

So HTTP in general can frame the response with one of three ways: content length, chunked encoding, or frame-by-connection-close.

If you don’t haven an explicit content length, HTTP/1.1 will chunk, but HTTP/1.0 can only frame by connection close(FIN).

Meanwhile, there’s another problem which is that if a client is sending data, and the request has not been completely read, a proxy responds with an error and closes the connection, many clients will get a TCP RST (due to uploading after FIN(close(fd))) and not actually read the response. That race is avoided with delayed_close_timeout.

It sounds like Envoy could do better detecting if a request is complete, and if so, framing with immediate close and I can pick that up. In the meantime if there’s any way to have your backend set a content length that should work around the problem, or you can lower delay close in the interim.

于是须要再 Fix:

Github PR: http: reduce delay-close issues for HTTP/1.1 and below #19863

Skipping delay close for:

HTTP/1.0 framed by connection close (as it simply reduces time to end-framing)

as well as HTTP/1.1 if the request is fully read (so there’s no FIN-RST race)。即系如果

Addresses the Envoy-specific parts of #19821
Runtime guard: envoy.reloadable_features.skip_delay_close

同时呈现在 Envoy 1.22.0 的 Release Note 里。须要留神的是，为了不影响性能，delayed_close_timeout 在很多状况下是不会失效的：：

http: avoiding delay-close for:

HTTP/1.0 responses framed by connection: close

as well as HTTP/1.1 if the request is fully read.

This means for responses to such requests, the FIN will be sent immediately after the response. This behavior can be temporarily reverted by setting envoy.reloadable_features.skip_delay_close to false. If clients are seen to be receiving sporadic partial responses and flipping this flag fixes it, please notify the project immediately.

Envoy 向已被 Upstream 敞开的 Upstream 连贯发送申请

Github Issue: Envoy (re)uses connection after receiving FIN from upstream #6815
With Envoy serving as HTTP/1.1 proxy, sometimes Envoy tries to reuse a connection even after receiving FIN from upstream. In production I saw this issue even with couple of seconds from FIN to next request, and Envoy never returned FIN ACK (just FIN from upstream to envoy, then PUSH with new HTTP request from Envoy to upstream). Then Envoy returns 503 UC even though upstream is up and operational.

Istio: 503’s with UC’s and TCP Fun Times

一个经典场景的时序图：from https://medium.com/@phylake/why-idle-timeouts-matter-1b3f7d4469fe

图中 Reverse Proxy 能够了解为 Envoy.

实质上是：

Upstream 对端调用 close(fd) 敞开了 socket。这注定了如果 kernel 还在这个 TCP 连贯上收到数据，就会抛弃且以 RST 回应。
Upstream 对端收回了 FIN
Upstream socket 状态更新为 FIN_WAIT_1 或 FIN_WAIT_2。

对于 Envoy 端，有两种可能：

Envoy 所在 kernel 中的 socket 状态曾经被对端发过来的 FIN 更新为 CLOSE_WAIT 状态，但 Envoy 程序(user-space)中未更新。
Envoy 所在 kernel 因网络提早等问题，还未收到 FIN。

但 Envoy 程序 re-use 了这个 socket ，并发送(write(fd)) HTTP Request(假如拆分为多个 IP 包) 。

这里又有两个可能：

在某一个 IP 包达到 Upstream 对端时，Upstream 返回了 RST。于是 Envoy 后继的 socket write 均可能会失败。失败阐明是相似 Upstream connection reset.
因为 socket write 是有 send buffer 的，是个异步操作。可能只在收到 RST 的下一个 epoll event cycle 中，产生 EV_CLOSED 事件，Envoy 才发现这个 socket 被 close 了。失败阐明是相似 Upstream connection reset.

Envoy 社区在这个问题有一些探讨，只能缩小可能，不可能完全避免：

Github Issue: HTTP1 conneciton pool attach pending request to half-closed connection #2715
The HTTP1 connection pool attach pending request when a response is complete. Though the upstream server may already closed the connection, this will result the pending request attached to it end up with 503.

协定与配置上的应答之法：

HTTP/1.1 has this inherent timing issue. As I already explained, this is solved in practice by

a) setting Connection: Closed when closing a connection immediately and

b) having a reasonable idle timeout.

The feature @ramaraochavali is adding will allow setting the idle timeout to less than upstream idle timeout to help with this case. Beyond that, you should be using router level retries.

说到底，这种问题因为 HTTP/1.1 的设计缺点，不可能完全避免。对于等幂的操作，还得依赖于 retry 机制。

Envoy 实现上的缓解

实现上，Envoy 社区已经想用让 upstream 连贯经验屡次 epool event cycle 再复用的办法失去连贯状态更新的事件。但这个计划不太好：

Github PR: Delay connection reuse for a poll cycle to catch closed connections.#7159(Not Merged)

So poll cycles are not an elegant way to solve this, when you delay N cycles, EOS may arrive in N+1-th cycle. The number is to be determined by the deployment so if we do this it should be configurable.

As noted in #2715, a retry (at Envoy level or application level) is preferred approach, #2715 (comment). Regardless of POST or GET, the status code 503 has a retry-able semantics defined in RFC 7231.

但最初，是用 connection re-use delay timer 的办法去实现：

All well behaving HTTP/1.1 servers indicate they are going to close the connection if they are going to immediately close it (Envoy does this). As I have said over and over again here and in the linked issues, this is well known timing issue with HTTP/1.1.

So to summarize, the options here are to:

Drop this change
Implement it correctly with an optional re-use delay timer.

最初的办法是：

Github PR: http: delaying attach pending requests #2871(Merged)

Another approach to #2715, attach pending request in next event after onResponseComplete.

即系限度一个 Upstream 连贯在一个 epoll event cycle 中，只能承载一个 HTTP Request。即一个连贯不能在同一个 epoll event cycle 中被多个 HTTP Request re-use 。这样能够缩小 kernel 中曾经是 CLOSE_WAIT 状态（取到 FIN）的申请，Envoy user-space 未感知到且 re-use 来发申请的可能性。

https://github.com/envoyproxy/envoy/pull/2871/files
@@ -209,25 +215,48 @@ void ConnPoolImpl::onResponseComplete(ActiveClient& client) {
    host_->cluster().stats().upstream_cx_max_requests_.inc();
    onDownstreamReset(client);
  } else {
-    processIdleClient(client);
    // Upstream connection might be closed right after response is complete. Setting delay=true
    // here to attach pending requests in next dispatcher loop to handle that case.
    // https://github.com/envoyproxy/envoy/issues/2715
+    processIdleClient(client, true);
  }
}
一些形容：https://github.com/envoyproxy/envoy/issues/23625#issuecomment-1301108769

There’s an inherent race condition that an upstream can close a connection at any point and Envoy may not yet know, assign it to be used, and find out it is closed. We attempt to avoid that by returning all connections to the pool to give the kernel a chance to inform us of FINs but can’t avoid the race entirely.

实现细节上，这个 Github PR 自身还有一个 bug ，在前面修改了：
Github Issue: Missed upstream disconnect leading to 503 UC#6190

Github PR: http1: enable reads when final pipeline response received#6578

这里有个插曲，Istio 在 2019 年是本人 fork 了一个 envoy 源码的，本人去解决这个问题：Istio Github PR: Fix connection reuse by delaying a poll cycle. #73 。不过最初，Istio 还是回归原生的 Envoy，只加了一些必要的 Envoy Filter Native C++ 实现。

Istio 配置上缓解

Istio Github Issue: Almost every app gets UC errors, 0.012% of all requests in 24h period#13848

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: passthrough-retries
  namespace: myapp
spec:
  workloadSelector:
    labels:
      app: myapp
  configPatches:
  - applyTo: HTTP_ROUTE
    match:
      context: SIDECAR_INBOUND
      listener:
        portNumber: 8080
        filterChain:
          filter:
            name: "envoy.filters.network.http_connection_manager"
            subFilter:
              name: "envoy.filters.http.router"
    patch:
      operation: MERGE
      value:
        route:
          retry_policy:
            retry_back_off:
              base_interval: 10ms
            retry_on: reset
            num_retries: 2

或

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: qqq-destination-rule
spec:
  host: qqq.aaa.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        idleTimeout: 3s
        maxRetries: 3

Linux 连贯敞开的临界状态与异样逻辑

摘自我的：https://devops-insider.mygraphql.com/zh_CN/latest/kernel/network/socket/socket-close/socket-close.html

如果你能保持看到这里，祝贺你，曾经到戏玉了。

Socket 的敞开，听起来是再简略不过的事件，不就是一个 close(fd) 的调用吗？上面缓缓道来。

Socket 敞开相干模型

在开始剖析一个事件前，我习惯先为事件相干方建设模型，而后边剖析，边欠缺模型。这样剖析逻辑时，就能够比拟全面，且前后因果逻辑能够推演和查看。要害是，模型能够重用。
钻研 Socket 敞开也不例外。

图：Socket 敞开相干模型

用 Draw.io 关上

上图是 机器A 与 机器B 倡议了 TCP socket 的状况。以 机器A 为例，剖析一个模型：

自底向上层有：

sott-IRQ/过程内核态解决 IP 包接管
socket 对象
socket 对象相干的 send buffer
socket 对象相干的 recv buffer
过程不间接拜访 socket 对象，而是有个 VFS 层，以 File Descriptor(fd) 为句柄读写 socket
- 一个 socket 能够被多个 fd 援用
过程以一个整数作为 fd 的 id ，在操作(调用kernel) 时带上这个 id 作为调用参数。
- 每个 fd 均有可独立敞开的 read channel 和 write channel

看完模型的动态元素后，说说模型的一些规定，这些规定在 kernel 程序中执行。且在下文中援用：

socket FD read channel 敞开
- socket FD read channel 敞开时，如果发现 recv buffer 中有曾经 ACK 的数据，未被利用(user-space)读取，将向对端发送 RST。详述在这：{ref}kernel/network/kernel-tcp/tcp-reset/tcp-reset:TCP RST and unread socket recv buffer
- socket FD read channel 敞开后，如果还收到对端的数据(TCP Half close)，将抛弃，且无情地以 RST 回应。

相干的 TCP 协定常识

这里只想说说敞开相干的局部。

:::{figure-md} 图：TCP 个别敞开流程
:class: full-width

图：TCP 个别敞开流程 – from [TCP.IP.Illustrated.Volume.1.The.Protocols]
:::

TCP Half-Close

TCP 是全双工连贯。从协定设计上，是反对长期的，敞开一方 stream 的状况，保留另一方 stream 的状况的。还有个专门的术语：Half-Close。

:::{figure-md} 图：TCP Half-Close 流程
:class: full-width

图：TCP Half-Close 流程 – from [TCP.IP.Illustrated.Volume.1.The.Protocols]
:::

说白了，就是一方感觉本人不会再发数据了，就能够先 Close 了本人的 write stream，发 FIN 给对端，通知它，我不会再发数据了。

socket fd 的敞开

[W. Richard Stevens, Bill Fenner, Andrew M. Rudoff – UNIX Network Programming, Volume 1] 一书中，说 socket fd 有两种敞开函数：

close(fd)
shutdown(fd)

close(fd)

[W. Richard Stevens, Bill Fenner, Andrew M. Rudoff – UNIX Network Programming, Volume 1] – 4.9 close Function

Since the reference count was still greater than 0, this call to close did not
initiate TCP’s four-packet connection termination sequence.

意思是， socket 是有 fd 的援用计数的。close 会缩小援用计数。只在援用计数为 0 时，发会启动连贯敞开流程（上面形容这个流程）。

The default action of close with a TCP socket is to mark the socket as closed and return to the process immediately. The fd is no longer usable by the process: It cannot be used as an argument to read or write. But, TCP will try to send any data that is already queued to be sent to the other end, and after this occurs, the normal TCP connection termination sequence takes place.
we will describe the SO_LINGER socket option, which lets us change this default action with a TCP socket.

意思是：这个函数默认会立刻返回。敞开了的 fd ，将不能够再读写。kernel 在后盾，会启动连贯敞开流程，在所有 socket send buffer 都发送完后，最初发 FIN 给对端。

这里临时不说 SO_LINGER。

close(fd) 其实是同时敞开了 fd 的 read channel 与 write channel。所以依据下面的模型规定：

socket FD read channel 敞开后，如果还收到对端的数据，将抛弃，且无情地以 RST 回应。

如果用 close(fd) 办法敞开了一个 socket后（调用返回后），对端因未收到 FIN，或尽管收到 FIN 但认为只是 Half-close TCP connection 而持续发数据过去的话，kernel 将无情地以 RST 回应：

图：敞开的 socket 收到数据，以 RST 回应 – from [UNIX Network Programming, Volume 1] – SO_LINGER Socket Option

这是 TCP 协定设计上的一个“缺点”。FIN 只能通知对端我要敞开出向流，却没办法通知对端：我不想再收数据了，要敞开入向流。但内核实现上，是能够敞开入向流的，且这个敞开在 TCP 层面无奈告诉对方，所以就呈现误会了。

shutdown(fd)

作为程序员，先看看函数文档。

##include <sys/socket.h>
int shutdown(int sockfd, int howto);

[UNIX Network Programming, Volume 1] – 6.6 shutdown Function

The action of the function depends on the value of the howto argument.

SHUT_RD
The read half of the connection is closed—No more data can be
received on the socket and any data currently in the socket receive buffer is discarded. The process can no longer issue any of the read functions on the socket. Any data received after this call for a TCP socket is acknowledged and then silently discarded.
SHUT_WR
The write half of the connection is closed—In the case of TCP, this is
called a half-close (Section 18.5 of TCPv1). Any data currently in the socket send buffer will be sent, followed by TCP’s normal connection termination sequence. As we mentioned earlier, this closing of the write half is done regardless of whether or not the socket descriptor’s reference count is currently greater than 0. The process can no longer issue any of the write functions on the socket.
SHUT_RDWR

The read half and the write half of the connection are both closed — This is equivalent to calling shutdown twice: first with SHUT_RD and then with SHUT_WR.

[UNIX Network Programming, Volume 1] – 6.6 shutdown Function

The normal way to terminate a network connection is to call the close function. But,
there are two limitations with close that can be avoided with shutdown:

close decrements the descriptor’s reference count and closes the socket only if
the count reaches 0. We talked about this in Section 4.8.

With shutdown, we can initiate TCP’s normal connection termination sequence (the four segments
beginning with a FIN), regardless of the reference count.

close terminates both directions of data transfer, reading and writing. Since a TCP connection is full-duplex, there are times when we want to tell the other end that we have finished sending, even though that end might have more data to send us(即系 Half-Close TCP connection).

意思是：shutdown(fd) 能够抉择双工中的一个方向敞开 fd。一般来说有两种应用场景：

只敞开出向(write)的，实现 Half close TCP
同时敞开出向与入向(write&read)

图：TCP Half-Close shutdown(fd) 流程 – from [UNIX Network Programming, Volume 1]

SO_LINGER

[UNIX Network Programming, Volume 1] – SO_LINGER Socket Option

This option specifies how the close function operates for a connection-oriented proto-
col (e.g., for TCP and SCTP, but not for UDP). By default, close returns immediately, but if there is any data still remaining in the socket send buffer, the system will try to deliver the data to the peer. 默认的 close(fd) 的行为是：

函数立即返回。kernel 后盾开启失常的敞开流程：异步发送在 socket send buffer 中的数据，最初发送 FIN。

如果 socket receive buffer 中有数据，数据将抛弃。

SO_LINGER ，顾名思义，就是 “彷徨” 的意思。

SO_LINGER 是一个 socket option，定义如下：

 struct linger {
int l_onoff; /* 0=off, nonzero=on */
int l_linger; /* linger time, POSIX specifies units as seconds */
};

SO_LINGER 如下影响 close(fd) 的行为：

If l_onoff is 0, the option is turned off. The value of l_linger is ignored and
the previously discussed TCP default applies: close returns immediately. 这是协默认的行为。
If l_onoff is nonzero
- l_linger is zero, TCP aborts the connection when it is closed (pp. 1019 – 1020 of TCPv2). That is, TCP discards any data still remaining in the socket send buffer and sends an RST to the peer, not the normal four-packet connection termination sequence . This avoids TCP’s TIME_WAIT state, but in doing so, leaves open the possibility of another incarnation of this connection being created within 2MSL seconds and having old duplicate segments from the just-terminated connection being incorrectly delivered to the new incarnation.（即，通过 RST 实现本地 port 的疾速回收。当然，有副作用）
- l_linger is nonzero, then the kernel will linger when the socket is closed (p. 472 of TCPv2). That is, if there is any data still remaining in the socket send buffer, the process is put to sleep until either: (即，过程在调用 close(fd) 时，会期待发送胜利ACK 或 timeout)
- all the data is sent and acknowledged by the peer TCP
- or the linger time expires.
If the socket has been set to nonblocking, it will not wait for the close to complete, even if the linger time is nonzero.

When using this feature of the SO_LINGER option, it is important for the application to check the return value from close, because if the linger time expires before the remaining data is sent and acknowledged, close returns EWOULDBLOCK and any remain ing data in the send buffer is discarded.(即开启了后，如果在 timeout 前还未收到 ACK，socket send buffer 中的数据可能失落)

close/shutdown 与 SO_LINGER 小结

[UNIX Network Programming, Volume 1] – SO_LINGER Socket Option

Function	Description
`shutdown`, `SHUT_RD`	No more receives can be issued on socket; process can still send on socket;<br/>socket receive buffer discarded; any further data received is discarded<br/>no effect on socket send buffer.
`shutdown`, `SHUT_WR` (这是大部分应用 shutdown 的场景)	No more sends can be issued on socket; process can still receive on socket;<br/>contents of socket send buffer sent to other end, followed by normal TCP connection termination (FIN); no effect on socket receive buffer.
`close`, `l_onoff` = 0<br/>(default)	No more receives or sends can be issued on socket; contents of socket send buffer sent to other end. If `descriptor reference count` becomes 0: – normal TCP connection termination (FIN) sent following data in send buffer. – socket receive buffer discarded.(即抛弃 recv buffer 未被 user space 读取的数据。<mark>留神：对于 Linux 如果是曾经 `ACK` 的数据未被 user-space 读取，将发送 `RST` 给对端</mark>)
`close`, `l_onoff` = 1<br/>`l_linger` = 0	No more receives or sends can be issued on socket. If `descriptor reference count` becomes 0: – `RST` sent to other end; connection state set to CLOSED<br/>(no TIME_WAIT state); – socket send buffer and socket receive buffer discarded.
`close`, `l_onoff` = 1<br/>`l_linger` != 0	No more receives or sends can be issued on socket; contents of socket send buffer sent to other end. If `descriptor reference count` becomes 0:<br/>- normal TCP connection termination (FIN) sent following data in send buffer; – socket receive buffer discarded; and if linger time expires before connection CLOSED, close returns EWOULDBLOCK. and any remain ing data in the `send buffer` is discarded.(即开启了后，如果在 timeout 前还未收到 ACK，socket send buffer 中的数据可能失落)

不错的扩大浏览：

一个写在 2016 年的文章：Resetting a TCP connection and SO_LINGER

结尾

在长期方向无大错的前提下，长线的投入和专一不肯定能够现成地解决问题和收到回报。但总会有天，会给人惊喜。致每一个技术人。

关于istio:EnvoyIstio-连接生命周期与临界异常-不知所谓的连接-REST

简介

引

Envoy 连贯生命周期治理

Upstream/Downstream 连贯解藕

连贯超时相干配置参数

idle_timeout

max_connection_duration

max_requests_per_connection

drain_timeout – for downstream only

delayed_close_timeout – for downstream only

Envoy 连贯敞开后的竞态条件

Envoy 与 Downstream/Upstream 连贯状态不同步

Downstream 向 Envoy 敞开中的连贯发送申请

Envoy 实现上的缓解

缓解服务端过早敞开连贯(Server Prematurely/Early Closes Connection)

Envoy 向已被 Upstream 敞开的 Upstream 连贯发送申请

Envoy 实现上的缓解

Istio 配置上缓解

Linux 连贯敞开的临界状态与异样逻辑

Socket 敞开相干模型

相干的 TCP 协定常识

TCP Half-Close

socket fd 的敞开

close(fd)

shutdown(fd)

SO_LINGER

close/shutdown 与 SO_LINGER 小结

结尾

评论

发表回复取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

关于istio:EnvoyIstio-连接生命周期与临界异常-不知所谓的连接-REST

简介

引

Envoy 连贯生命周期治理

Upstream/Downstream 连贯解藕

连贯超时相干配置参数

idle_timeout

max_connection_duration

max_requests_per_connection

drain_timeout – for downstream only

delayed_close_timeout – for downstream only

Envoy 连贯敞开后的竞态条件

Envoy 与 Downstream/Upstream 连贯状态不同步

Downstream 向 Envoy 敞开中的连贯发送申请

Envoy 实现上的缓解

缓解 服务端过早敞开连贯(Server Prematurely/Early Closes Connection)

Envoy 向已被 Upstream 敞开的 Upstream 连贯发送申请

Envoy 实现上的缓解

Istio 配置上缓解

Linux 连贯敞开的临界状态与异样逻辑

Socket 敞开相干模型

相干的 TCP 协定常识

TCP Half-Close

socket fd 的敞开

close(fd)

shutdown(fd)

SO_LINGER

close/shutdown 与 SO_LINGER 小结

结尾

评论

发表回复 取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

缓解服务端过早敞开连贯(Server Prematurely/Early Closes Connection)

发表回复取消回复