乐趣区

关于tcp:可能是最完整的-TCP-连接健康指标工具-ss-的说明

  • 写在后面
  • TCP 连贯衰弱的重要性
  • 如何查看 TCP 连贯衰弱

    • 容器化时代
    • 曾神秘的 ss
    • 更神秘的无文档指标
  • ss 简介
  • 字段阐明

    • Recv- Q 与 Send-Q
    • 根本信息
    • MTU/MSS 相干

      • mss
      • advmss
      • pmtu
      • rcvmss
    • Flow control 流控

      • cwnd
      • ssthresh
    • retrans 重传相干

      • retrans
      • bytes_retrans
    • timer 定时器
    • Other

      • app_limited
  • 特地操作

    • specified network namespace
    • kill socket
    • 监听连贯敞开事件
    • 过滤器
  • 原理

    • Netlink
    • NETLINK_INET_DIAG

      • idiag_ext
      • Netlink in deep
  • 参考

写在后面

我不是网格专家,只是在经验了多年的生产和测试环境网络问题排查后,不想再得过且过,于是记录下所学到的常识。因为对 TCP 栈的实现理解无限,所以内容仅作参考。

TCP 连贯衰弱的重要性

TCP 连贯衰弱起码包含:

  • TCP 重传统计,这是网络品质的风向标
  • MTU/MSS 大小,拥挤窗口的大小,这是带宽与吞吐的重要指标
  • 各层收发队列与缓存的统计

这个问题在《从性能问题定位,扯到性能模型,再到 TCP – 都微服务云原生了,还学 TCP 干嘛系列 Part 1》中我聊过,不再反复。

如何查看 TCP 连贯衰弱

Linux 的 TCP 连贯衰弱指标有两种:

  • 整机的统计

    聚合了整机(严格来说,是整个 network namespace 或 整个 container) 的网络衰弱指标。可用 nstat 查看。

  • 每个 TCP 连贯的统计

    每个 TCP 连贯均在内核中保留了统计数据。可用 ss 查看。

本文只关注 每个 TCP 连贯的统计 整机的统计 请到 这篇 查看。

容器化时代

理解过 Linux 下容器化原理的同学应该晓得,在内核层都是 namespace + cgroup。而下面说的 TCP 连贯衰弱指标,也是 namespace aware 的。即每个 network namespace 独立统计。在容器化时,什么是 namespace aware,什么不是,肯定要分分明。

曾神秘的 ss

置信很多人用过 netstat。但 netstat 因为在连贯量大时性能不佳的问题,曾经缓缓由 ss 代替。如果你好奇 ss 的实现原理,那么转到本文的“原理”一节。

参考:https://www.net7.be/blog/arti…

更神秘的无文档指标

ss 简介

ss 是个查看连贯明细统计的工具。示例:

$ ss -taoipnm
State        Recv-Q   Send-Q      Local Address:Port         Peer Address:Port  Process                                                                         
ESTAB 0      0               159.164.167.179:55124           149.139.16.235:9042  users:(("envoy",pid=81281,fd=50))
         ts sack cubic wscale:9,7 rto:204 rtt:0.689/0.065 ato:40 mss:1448 pmtu:9000 rcvmss:610 advmss:8948 cwnd:10 bytes_sent:3639 bytes_retrans:229974096 bytes_acked:3640 bytes_received:18364 segs_out:319 segs_in:163 data_segs_out:159 data_segs_in:159 send 168.1Mbps lastsnd:16960 las
trcv:16960 lastack:16960 pacing_rate 336.2Mbps delivery_rate 72.4Mbps delivered:160 app_limited busy:84ms retrans:0/25813 rcv_rtt:1 rcv_space:62720 rcv_ssthresh:56588 minrtt:0.16

具体见手册:https://man7.org/linux/man-pa…

字段阐明

⚠️ 我不是网络专家,以下阐明是我最近的一些学习后果,不排除有错。请审慎应用。

Recv- Q 与 Send-Q

  • 当 socket 是 listen 状态(eg: ss -lnt)
    Recv-Q: 全连贯队列的大小,也就是以后已实现三次握手并期待服务端 accept() 的 TCP 连贯
    Send-Q: 全连贯最大队列长度
  • 当 socket 是非 listen 状态(eg: ss -nt)
    Recv-Q: 未被利用过程读取的字节数;
    Send-Q: 已发送但未收到确认的字节数;

Recv-Q

Established: The count of bytes not copied by the user program connected to this socket.

Listening: Since Kernel 2.6.18 this column contains the current syn backlog.

Send-Q

Established: The count of bytes not acknowledged by the remote host.

Listening: Since Kernel 2.6.18 this column contains the maximum size of the syn backlog.

根本信息

  • ts 连贯是否蕴含工夫截。

    show string “ts” if the timestamp option is set

  • sack 连贯时否关上 sack

    show string “sack” if the sack option is set

  • cubic 拥挤窗口算法名。

    congestion algorithm name

  • wscale:<snd_wscale>:<rcv_wscale> 发送与接管窗口大小的 放大系数 。因 19xx 年代时,网络和计算机资源无限,过后制订的 TCP 协定留给窗口大小的字段取值范畴很小。到当初高带宽时代,须要一个 放大系数 才可能有大窗口。

    if window scale option is used, this field shows the send scale factor and receive scale factor.

  • rto 动静计算出的 TCP 重传用的超时参数,单位毫秒。

    tcp re-transmission timeout value, the unit is millisecond.

  • rtt:<rtt>/<rttvar> RTT,测量与估算出的一个 IP 包发送对端和反射回来的用时。rtt 是平均值,rttvar 是中位数。

    rtt is the average round trip time, rttvar is the mean deviation of rtt, their units are millisecond.

  • ato:<ato> delay ack 超时工夫。

    ack timeout, unit is millisecond, used for delay ack mode.

其它:

              bytes_acked:<bytes_acked>
                     bytes acked

              bytes_received:<bytes_received>
                     bytes received

              segs_out:<segs_out>
                     segments sent out

              segs_in:<segs_in>
                     segments received

              send <send_bps>bps
                     egress bps

              lastsnd:<lastsnd>
                     how long time since the last packet sent, the unit
                     is millisecond

              lastrcv:<lastrcv>
                     how long time since the last packet received, the
                     unit is millisecond

              lastack:<lastack>
                     how long time since the last ack received, the unit
                     is millisecond

              pacing_rate <pacing_rate>bps/<max_pacing_rate>bps
                     the pacing rate and max pacing rate

              rcv_space:<rcv_space>
                     a helper variable for TCP internal auto tuning
                     socket receive buffer

MTU/MSS 相干

mss

连贯以后应用的,用于限度发送报文大小的 MSS。current effective sending MSS.

https://github.com/CumulusNet…

s.mss         = info->tcpi_snd_mss

https://elixir.bootlin.com/li…

   info->tcpi_snd_mss = tp->mss_cache;

https://elixir.bootlin.com/li…

/*
tp->mss_cache is current effective sending mss, including
all tcp options except for SACKs. It is evaluated,
taking into account current pmtu, but never exceeds
tp->rx_opt.mss_clamp.
...
*/
unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
{
...
   tp->mss_cache = mss_now;

   return mss_now;
}

advmss

连贯建设时,由本机发送出的 SYN 报文中,蕴含的 MSS Option。其指标是在建设连贯时,就通知对端本机能够接管的最大报文大小。Advertised MSS by the host when conection started(in SYN packet).

https://elixir.bootlin.com/li…

pmtu

通过 Path MTU Discovery 发现到的对端 MTU。Path MTU value.

这里有几点留神的:

  • Linux 会把每个测量过的对端 IP 的 MTU 值缓存到 Route Cache,这能够防止雷同对端反复走 Path MTU Discovery 流程
  • Path MTU Discovery 在 Linux 中有两种不同的实现办法

    • 传统基于 ICMP 的 RFC1191

      • 但当初很多路由和 NAT 不能正确处理 ICMP
    • Packetization Layer Path MTU Discovery (PLPMTUD, RFC 4821 and RFC 8899)

https://github.com/shemminger…

        s.pmtu         = info->tcpi_pmtu;

https://elixir.bootlin.com/li…

    info->tcpi_pmtu = icsk->icsk_pmtu_cookie;

https://elixir.bootlin.com/li…

//@icsk_pmtu_cookie       Last pmtu seen by socket
struct inet_connection_sock {
    ...
    __u32              icsk_pmtu_cookie;

https://elixir.bootlin.com/li…

unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
{
 /* And store cached results */
    icsk->icsk_pmtu_cookie = pmtu;

https://elixir.bootlin.com/li…

https://elixir.bootlin.com/li…

https://elixir.bootlin.com/li…

rcvmss

诚实说,这个我没看明确。一些参考:

MSS used for delayed ACK decisions.

https://elixir.bootlin.com/li…

        __u16          rcv_mss;     /* MSS used for delayed ACK decisions       */

https://elixir.bootlin.com/li…

/* Initialize RCV_MSS value.
 * RCV_MSS is an our guess about MSS used by the peer.
 * We haven't any direct information about the MSS.
 * It's better to underestimate the RCV_MSS rather than overestimate.
 * Overestimations make us ACKing less frequently than needed.
 * Underestimations are more easy to detect and fix by tcp_measure_rcv_mss().
 */
void tcp_initialize_rcv_mss(struct sock *sk)
{const struct tcp_sock *tp = tcp_sk(sk);
    unsigned int hint = min_t(unsigned int, tp->advmss, tp->mss_cache);

    hint = min(hint, tp->rcv_wnd / 2);
    hint = min(hint, TCP_MSS_DEFAULT);
    hint = max(hint, TCP_MIN_MSS);

    inet_csk(sk)->icsk_ack.rcv_mss = hint;
}

Flow control 流控

cwnd

cwnd: 拥塞窗口大小。congestion window size

https://en.wikipedia.org/wiki…,small%20multiple,-of%20the%20maximum

拥塞窗口字节大小 = cwnd * mss.

ssthresh

在本机 TCP 层检测到网络拥塞产生后,会放大拥塞窗口,这个值就是相干的参数。

              ssthresh:<ssthresh>
                     tcp congestion window slow start threshold

retrans 重传相干

retrans

TCP 重传统计。格局为:

重传且未收到 ack 的 segment 数 / 整个连贯的总重传 segment 次数。

https://unix.stackexchange.co…

(Retransmitted packets out) / (Total retransmits for entire connection)

add more TCP_INFO components

retrans:X/Y

     X: number of outstanding retransmit packets

​ Y: total number of retransmits for the session

  • s.retrans_total

https://github.com/shemminger…

        s.retrans_total  = info->tcpi_total_retrans;

https://elixir.bootlin.com/li…

struct tcp_info {
        __u32    tcpi_retrans;
    __u32    tcpi_total_retrans;

https://elixir.bootlin.com/li…

    info->tcpi_total_retrans = tp->total_retrans;

https://elixir.bootlin.com/li…

struct tcp_sock {u32    total_retrans;    /* Total retransmits for entire connection */
  • s.retrans

重传且未收到 ack 的 segment 数

https://github.com/shemminger…

        s.retrans     = info->tcpi_retrans;

https://elixir.bootlin.com/li…

    info->tcpi_retrans = tp->retrans_out;

https://elixir.bootlin.com/li…

struct tcp_sock {u32    retrans_out;    /* Retransmitted packets out        */

bytes_retrans

重传输的总数据字节数。Total data bytes retransmitted

timer 定时器

初入門 TCP 实现的同学,很难想像,TCP 除了输出与输入事件驱动外,其实还由很多定时器去驱动的。ss 可能查看这些定时器。

              Show timer information. For TCP protocol, the output
              format is:

              timer:(<timer_name>,<expire_time>,<retrans>)

              <timer_name>
                     the name of the timer, there are five kind of timer
                     names:

                     on : means one of these timers: TCP retrans timer,
                     TCP early retrans timer and tail loss probe timer

                     keepalive: tcp keep alive timer

                     timewait: timewait stage timer

                     persist: zero window probe timer

                     unknown: none of the above timers

              <expire_time>
                     how long time the timer will expire

Other

app_limited

https://unix.stackexchange.co…

limit TCP flows with application-limiting in request or responses. 我了解是,这是个 boolean,如果 ss 显示了 app_limited 这个标记,表白利用未齐全应用所有 TCP 发送带宽,即,连贯还有余力发送更多。

  tcpi_delivery_rate: The most recent goodput, as measured by
    tcp_rate_gen(). If the socket is limited by the sending
    application (e.g., no data to send), it reports the highest
    measurement instead of the most recent. The unit is bytes per
    second (like other rate fields in tcp_info).

  tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
    was measured when the socket's throughput was limited by the
    sending application.

https://github.com/shemminger…

        s.app_limited = info->tcpi_delivery_rate_app_limited;

https://elixir.bootlin.com/li…

/* If a gap is detected between sends, mark the socket application-limited. */
void tcp_rate_check_app_limited(struct sock *sk)
{struct tcp_sock *tp = tcp_sk(sk);

    if (/* We have less than one packet to send. */
        tp->write_seq - tp->snd_nxt < tp->mss_cache &&
        /* Nothing in sending host's qdisc queues or NIC tx queue. */
        sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1) &&
        /* We are not limited by CWND. */
        tcp_packets_in_flight(tp) < tp->snd_cwnd &&
        /* All lost packets have been retransmitted. */
        tp->lost_out <= tp->retrans_out)
        tp->app_limited =
            (tp->delivered + tcp_packets_in_flight(tp)) ? : 1;
}

特地操作

specified network namespace

指定 ss 用的 network namespace 文件,如 ss -N /proc/322/ns/net

       -N NSNAME, --net=NSNAME
              Switch to the specified network namespace name.

kill socket

强制敞开 TCP 连贯。

       -K, --kill
              Attempts to forcibly close sockets. This option displays
              sockets that are successfully closed and silently skips
              sockets that the kernel does not support closing. It
              supports IPv4 and IPv6 sockets only.
sudo ss -K  'dport 22'

监听连贯敞开事件

ss -ta -E
State                  Recv-Q                 Send-Q                                   Local Address:Port                                     Peer Address:Port                     Process                 

UNCONN                0                     0                                        10.0.2.15:40612                                            172.67.141.218:http               

过滤器

如:

ss -apu state unconnected 'sport = :1812'

原理

Netlink

https://events.static.linuxfo…

https://man7.org/linux/man-pa…

socket(AF_NETLINK, SOCK_RAW, NETLINK_INET_DIAG);
/**
       NETLINK_SOCK_DIAG (since Linux 3.3)
              Query information about sockets of various protocol
              families from the kernel (see sock_diag(7)).
**/
  • Fetch information about sockets – Used by ss (“another utility to investigate sockets”)

NETLINK_INET_DIAG

https://man7.org/linux/man-pa…

idiag_ext

这里能够看看 ss 的数据源。就是另一个侧面的文档了。

https://man7.org/linux/man-pa…,idiag_ext,-This%20is%20a

    The fields of struct inet_diag_req_v2 are as follows:

       idiag_ext
          This is a set of flags defining what kind of extended
          information to report.  Each requested kind of information
          is reported back as a netlink attribute as described
          below:

          INET_DIAG_TOS
                 The payload associated with this attribute is a
                 __u8 value which is the TOS of the socket.

          INET_DIAG_TCLASS
                 The payload associated with this attribute is a
                 __u8 value which is the TClass of the socket.  IPv6
                 sockets only.  For LISTEN and CLOSE sockets, this
                 is followed by INET_DIAG_SKV6ONLY attribute with
                 associated __u8 payload value meaning whether the
                 socket is IPv6-only or not.

          INET_DIAG_MEMINFO
                 The payload associated with this attribute is
                 represented in the following structure:

                     struct inet_diag_meminfo {
                         __u32 idiag_rmem;
                         __u32 idiag_wmem;
                         __u32 idiag_fmem;
                         __u32 idiag_tmem;
                     };

                 The fields of this structure are as follows:

                 idiag_rmem
                        The amount of data in the receive queue.

                 idiag_wmem
                        The amount of data that is queued by TCP but
                        not yet sent.

                 idiag_fmem
                        The amount of memory scheduled for future
                        use (TCP only).

                 idiag_tmem
                        The amount of data in send queue.

          INET_DIAG_SKMEMINFO
                 The payload associated with this attribute is an
                 array of __u32 values described below in the
                 subsection "Socket memory information".

          INET_DIAG_INFO
                 The payload associated with this attribute is
                 specific to the address family.  For TCP sockets,
                 it is an object of type struct tcp_info.

          INET_DIAG_CONG
                 The payload associated with this attribute is a
                 string that describes the congestion control
                 algorithm used.  For TCP sockets only.

       idiag_timer
              For TCP sockets, this field describes the type of timer
              that is currently active for the socket.  It is set to one
              of the following constants:

                   0      no timer is active
                   1      a retransmit timer
                   2      a keep-alive timer
                   3      a TIME_WAIT timer
                   4      a zero window probe timer

              For non-TCP sockets, this field is set to 0.

       idiag_retrans
              For idiag_timer values 1, 2, and 4, this field contains
              the number of retransmits.  For other idiag_timer values,
              this field is set to 0.

       idiag_expires
              For TCP sockets that have an active timer, this field
              describes its expiration time in milliseconds.  For other
              sockets, this field is set to 0.

       idiag_rqueue
              For listening sockets: the number of pending connections.

              For other sockets: the amount of data in the incoming
              queue.

       idiag_wqueue
              For listening sockets: the backlog length.

              For other sockets: the amount of memory available for
              sending.
       idiag_uid
              This is the socket owner UID.

       idiag_inode
              This is the socket inode number.
              
   Socket memory information
       The payload associated with UNIX_DIAG_MEMINFO and
       INET_DIAG_SKMEMINFO netlink attributes is an array of the
       following __u32 values:

       SK_MEMINFO_RMEM_ALLOC
              The amount of data in receive queue.

       SK_MEMINFO_RCVBUF
              The receive socket buffer as set by SO_RCVBUF.

       SK_MEMINFO_WMEM_ALLOC
              The amount of data in send queue.

       SK_MEMINFO_SNDBUF
              The send socket buffer as set by SO_SNDBUF.

       SK_MEMINFO_FWD_ALLOC
              The amount of memory scheduled for future use (TCP only).

       SK_MEMINFO_WMEM_QUEUED
              The amount of data queued by TCP, but not yet sent.

       SK_MEMINFO_OPTMEM
              The amount of memory allocated for the socket's service
              needs (e.g., socket filter).

       SK_MEMINFO_BACKLOG
              The amount of packets in the backlog (not yet processed).

留神下面的:INET_DIAG_INFO

For TCP sockets, it is an object of type struct tcp_info

Netlink in deep

https://wiki.linuxfoundation….

https://medium.com/thg-tech-b…

参考

https://djangocas.dev/blog/hu…

https://man7.org/linux/man-pa…

退出移动版