关于istio:特定条件下-Istio-发生-halfclose-连接泄漏与出站连接失败

35次阅读

共计 4448 个字符,预计需要花费 12 分钟才能阅读完成。

在 4 年前,Istio 让我眼前一亮的个性莫过于利用无关的 流量拦挡 通明代理。这为低成本进入 Service Mesh 时代大大降低了开发门槛。也是很多公司引入 Istio 的次要起因之一。

有句话说:

你的最大长处有时候也是你的最大毛病的起源。

Istio 默认应用了 iptableREDIRECT rule 作为 DNAT 出站流量到 sidecar 的办法。

如果读者比拟理解 NAT,也肯定或多或少晓得,NAT 是个不完满,有比拟显著缺点(哪个技术没有?),但事实中宽泛应用的技术。包含你家的路由器上也在应用。

NAT 技术,在 Linux kernel 中,个别是以 conntrack + NAT iptable 实现的。它是 NAT 的一个实现。它有着和 NAT 技术相似的问题。

用 Draw.io 关上

问题

上面讲述两个问题。两个问题都只在特定条件下才触发:

  1. TCP Proxy 局部场景 half-closed 连贯透露 1 小时
  2. 应用程序出站(outbound)连贯超时,因为应用程序抉择了一个与 15001(outbound)listener 上的现有套接字抵触的长期端口

问题 1问题 2 的诱因之一。

因为之前与社区沟通过程次要应用英文,还未来得及翻译,上面次要还是英文叙述。

TCP Proxy 局部场景 half-closed 连贯透露 1 小时

有配图的细节,可见我的书《istio insider》中的 TCP Proxy half-closed connection leak for 1 hour in some scenarios

Sidecar intercept and TCP proxy all outbound TCP connection by default:
(app --[conntrack DNAT]--> sidecar) -----> upstream-tcp-service

  1. When upstream-tcp-service want to disconnect, it sent FIN.
  2. sidecar received FIN and call shutdown(fd,ENVOY_SHUT_WR) syscall on the downsteam socket to forward the FIN to app and keep the connection half-close. The socket state is FIN_WAIT2 now.
  3. conntrack table will start a 60s timer(/proc/sys/net/netfilte/nf_conntrack_tcp_timeout_close_wait). After timeout, DNAT entry will be removed.
  4. app received FIN
  5. In normal scenarios, after receive FIN, the app will call close() quickly and it close the socket and reply FIN, then all 2 sockets in sidecar will be closed.
  6. BUT, if the app call close() after 60s. The FIN sent by app will not deliver to sidecar. Because the conntrack table DNAT entry was removed at 60s.

So 2 sockets leaked on sidecar.

We know that, gernally speaking, FIN_WAIT2 socket has timer to close it: /proc/sys/net/ipv4/tcp_fin_timeout

But for a half-closed FIN_WAIT2 socket(shutdown(fd,ENVOY_SHUT_WR)), no timer exists.

Good news is:Envoy TCPProxy Filter has an idle_timeout setting which by default is 1 hour. So above problem will have a 1 hour leak window before being GC.

应用程序出站(outbound)连贯超时,因为应用程序抉择了一个与 15001(outbound)listener 上的现有套接字抵触的长期端口

有配图的细节,可见我的书《istio insider》中的 App outbound connecting timed out because App selected a ephemeral port which collisions with the existing socket on 15001(outbound) listener

Sidecar intercept and TCP proxying all outbound TCP connection by default:
(app --[conntrack DNAT]--> sidecar) -----> upstream-tcp-service

But in some scenarios, App just get a connect timed out error when connecting to the sidecar 15001(outbound) listener.

Scenarios:

  1. When sidecar has a half-open connection to App. e.g:

    $ ss
    tcp FIN-WAIT-2 0 0 127.0.0.1:15001  172.29.73.7:44410(POD_IP:ephemeral_port)

    This can happen, eg: TCP Proxy half-closed connection leak for 1 hour in some scenarios #43297

    There is no track entry in conntrack table because nf_conntrack_tcp_timeout_close_wait time out and expired.

  2. App invoke syscall connect(sockfd, peer_addr) , kernel allocation a ephemeral port(44410 in this case) , bind the new socket to that ephemeral port and sent SYN packet to peer.
  3. SYN packet reach conntrack and it create a track entry in conntrack table:

    $ conntrack -L
    tcp  6 108 SYN_SENT src=172.29.73.7 dst=172.21.206.198 sport=44410 dport=7777 src=127.0.0.1 dst=172.29.73.7 sport=15001 dport=44410
  1. SYN packet DNAT to 127.0.0.1:15001
  2. SYN packet reach the already existing FIN-WAIT-2 127.0.0.1:15001 172.29.73.7:44410 socket, then sidecar reply a TCP Challenge ACK (TCP seq-no is from the old FIN-WAIT-2) packet to App
  3. App reply the TCP Challenge ACK with a RST(TCP seq-no is from the TCP Challenge ACK)
  4. Conntrack get the RST packet and check it. In some kernel version, conntrack just invalid the RST packet because the seq-no is out of the track entry in conntrack table which created in step 3.
  5. App will retransmit SYN but all without an expected SYN/ACK reply. Connect timed out will happen on App user space.

Different kernel version may have different packet validate rule in step 7:

RST packet mark as invalid:
  SUSE Linux Enterprise Server 15 SP4:
    5.14.21-150400.24.21-default

# cat /proc/sys/net/netfilter/nf_conntrack_tcp_ignore_invalid_rst
0

####################
    
RST packet passed check and NATed:
  Ubuntu 20.04.2:
    5.4.0-137-generic
    
# cat /proc/sys/net/netfilter/nf_conntrack_tcp_ignore_invalid_rst
cat: /proc/sys/net/netfilter/nf_conntrack_tcp_ignore_invalid_rst: No such file or directory    

It seems related to kernel patch: [Add tcp_ignore_invalid_rst sysctl to allow to disable out of
segment RSTs](https://github.com/torvalds/l…) which merge to kernel after kernel v5.14

Good news is that, someone will fix the problem at kernel v6.2-rc7: netfilter: conntrack: handle tcp challenge acks during connection reuse:

When a connection is re-used, following can happen:
[connection starts to close, fin sent in either direction]
 > syn   # initator quickly reuses connection
 < ack   # peer sends a challenge ack
 > rst   # rst, sequence number == ack_seq of previous challenge ack
 > syn   # this syn is expected to pass

Problem is that the rst will fail window validation, so it gets
tagged as invalid.

If ruleset drops such packets, we get repeated syn-retransmits until
initator gives up or peer starts responding with syn/ack.

But in some scenarios and kernel version, it will be an issue anyway.

正文完
 0