The first car in my life - Fabia - Skoda
指标
k8s 环境下,在不进行或重启 container 的状况下,重启利用过程(pid:1),甚至从新加载运行新版本的利用。本文以 gdb 作为工具,调用利用容器自带的 libc 的 close & exec 函数,去实现这个指标。
背景
K8s 显然曾经由衰亡转向成熟。大潮过后,是时候思考一下,当初吹过的牛有哪些是真的,哪些是还未对现的。不可否认,k8s 改革了运维的工作形式,这根本是提高的。但对于开发,特地是阻碍问题定位、程序调试办法,显然难度是减少了。
在利用的阻碍问题定位、程序调试时,咱们时常心愿能在雷同的环境下反复重启利用,去察看咱们对配置的批改,或者程序的更新是否真正解决了问题。要实现这个指标,通常须要:
- 批改利用代码,跑 CI pipeline 从新打包 docker image,上传。—— 费时费力
- 想方法重启容器。—— 环境毁坏了,问题可能重现不了
如果容器启动脚本设计时就反对重启,当然没问题,但大部分状况下,均是不间接反对的,很多时候,利用主过程就间接是 container 的主过程 pid:1 了。
我钻研过三个办法去替换主过程:
- gdb 调用 libc 的 execl 。 这是本文要说的办法。
- gdb 调用 syscall execve 。 这个办法比较复杂,但也更通用,还在钻研。
kill -STOP
挂起主过程的父过程- gdb 主过程的父过程,让它收不到
SIGCHLD
正告
因为本文应用了 gdb attach 和非常规办法敞开文件描述符(close(fd))和替换过程执行文件(exec),潜在比拟大的未知危险,请不要在生产环境中应用。我也未充沛验证这个办法的可靠性,和副作用。包含 close 和 exec 是否能洁净清理后任的问题。所以,应用有危险。
思路
- gdb attach 过程
- 调用 close fd ,特地是 socket 相干的,特地是 listen tcp port 的。
- 调用 exec 执行雷同的 elf 文件或不同的 elf 文件
步骤
搭建试验指标环境
环境阐明:
node: 192.168.122.55
- Ubuntu 22.04.2 LTS
- kernel: 5.4.0-152-generic
- hostname: worknode5
- gdb 9.2
- docker image repo: 192.168.122.1:5000
我应用 tomcat 作为例子:
docker pull tomcat:9.0.76-jre17docker tag tomcat:9.0.76-jre17 localhost:5000/tomcat:9.0.76-jre17docker push localhost:5000/tomcat:9.0.76-jre17
运行 pod:
kubectl delete StatefulSet tomcat-worknode5kubectl apply -f - <<"EOF"apiVersion: apps/v1kind: StatefulSetmetadata: name: tomcat-worknode5 labels: app: tomcat-worknode5spec: replicas: 1 selector: matchLabels: app: tomcat-worknode5 template: metadata: labels: app.kubernetes.io/name: tomcat-worknode5 app: tomcat-worknode5 annotations: sidecar.istio.io/inject: "false" spec: restartPolicy: Always imagePullSecrets: - name: docker-registry-key containers: - name: main-app image: 192.168.122.1:5000/tomcat:9.0.76-jre17 imagePullPolicy: IfNotPresent ports: - containerPort: 8080 protocol: TCP name: http affinity: #make sure runing at worknode5(192.168.122.55) nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "kubernetes.io/hostname" operator: In values: - "worknode5" EOF
查看 tomcat 过程
ssh labile@192.168.122.55# get the pid of tomcatexport POD="tomcat-worknode5-0"PIDS=$(pgrep java)while IFS= read -r TPID; do HN=$(sudo nsenter -u -t $TPID hostname) if [[ "$HN" = "$POD" ]]; then # space between = is important sudo nsenter -u -t $TPID hostname export POD_PID=$TPID fidone <<< "$PIDS"echo $POD_PIDexport PID=$POD_PID# 过程的状态:➜ ~ ps -f $PID | catUID PID PPID C STIME TTY STAT TIME CMDroot 38123 37929 2 07:06 ? Ssl 0:04 /opt/java/openjdk/bin/java -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dorg.apache.catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp org.apache.catalina.startup.Bootstrap start# 过程树结构:➜ ~ ps -ef --forest | grep -B2 $PIDroot 37929 1 0 07:06 ? 00:00:00 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 9a43e5d86f29e64eb67ccf2ef19d442e4a06af69def716e181fa52694ec9f43b -address /run/containerd/containerd.sock65535 37963 37929 0 07:06 ? 00:00:00 \_ /pauseroot 38123 37929 1 07:06 ? 00:00:05 \_ /opt/java/openjdk/bin/java -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apac....local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp org.apache.catalina.startup.Bootstrap start# 可见 socket fd➜ ~ sudo ls -l /proc/$PID/fdtotal 0lrwx------ 1 root root 64 Jun 18 07:06 0 -> /dev/nulll-wx------ 1 root root 64 Jun 18 07:06 1 -> 'pipe:[181879]'l-wx------ 1 root root 64 Jun 18 07:07 10 -> /usr/local/tomcat/logs/host-manager.2023-06-18.log...lr-x------ 1 root root 64 Jun 18 07:07 19 -> /usr/local/tomcat/lib/tomcat-i18n-es.jarl-wx------ 1 root root 64 Jun 18 07:06 2 -> 'pipe:[181880]'lr-x------ 1 root root 64 Jun 18 07:07 20 -> /usr/local/tomcat/lib/tomcat-i18n-cs.jar...lrwx------ 1 root root 64 Jun 18 07:07 43 -> 'socket:[182049]'lrwx------ 1 root root 64 Jun 18 07:07 44 -> 'socket:[182552]'l-wx------ 1 root root 64 Jun 18 07:07 45 -> /usr/local/tomcat/logs/localhost_access_log.2023-06-18.txtlrwx------ 1 root root 64 Jun 18 07:07 46 -> 'anon_inode:[eventpoll]'lrwx------ 1 root root 64 Jun 18 07:07 47 -> 'anon_inode:[eventfd]'lrwx------ 1 root root 64 Jun 18 07:07 49 -> 'socket:[182059]'...l-wx------ 1 root root 64 Jun 18 07:07 9 -> /usr/local/tomcat/logs/manager.2023-06-18.log➜ ~ sudo strings /proc/$PID/environ KUBERNETES_SERVICE_PORT_HTTPS=443KUBERNETES_SERVICE_PORT=443HOSTNAME=tomcat-worknode5-0LANGUAGE=en_US:enJAVA_HOME=/opt/java/openjdkGPG_KEYS=48F8E69F6390C9F25CFEDCD268248959359E722B A9C5DF4D22E99998D9875A5110C01C5A2F6059E7 DCFD35E0BF8CA7344752DE8B6FB21E8933C60243HTTPBIN_PORT_80_TCP_PORT=80PWD=/usr/local/tomcatTOMCAT_SHA512=028163cbe15367f0ab60e086b0ebc8d774e62d126d82ae9152f863d4680e280b11c9503e3b51ee7089ca9bea1bfa5b535b244a727a3021e5fa72dd7e9569af9aTOMCAT_MAJOR=9HOME=/rootLANG=en_US.UTF-8KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443HTTPBIN_SERVICE_PORT=80TOMCAT_NATIVE_LIBDIR=/usr/local/tomcat/native-jni-libCATALINA_HOME=/usr/local/tomcatSHLVL=0KUBERNETES_PORT_443_TCP_PROTO=tcpJDK_JAVA_OPTIONS= --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMEDFORTIO_SERVER_SERVICE_PORT_HTTP=8080KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1LD_LIBRARY_PATH=/usr/local/tomcat/native-jni-libKUBERNETES_SERVICE_HOST=10.96.0.1LC_ALL=en_US.UTF-8KUBERNETES_PORT=tcp://10.96.0.1:443KUBERNETES_PORT_443_TCP_PORT=443PATH=/usr/local/tomcat/bin:/opt/java/openjdk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binTOMCAT_VERSION=9.0.76JAVA_VERSION=jdk-17.0.7+7➜ ~ sudo nsenter -n -t $PID ss -tpna | grep $PID# 可见 fd 与 listen port 的对应关系LISTEN 0 1 [::ffff:127.0.0.1]:8005 *:* users:(("java",pid=38123,fd=49)) LISTEN 0 100 *:8080 *:* users:(("java",pid=38123,fd=43))
下面曾经写得比拟明细了,其它补充如下:
- 过程的状态:ps 的 STAT 显示
Ssl
。阐明见 man ps
D uninterruptible sleep (usually IO) I Idle kernel thread R running or runnable (on run queue) S interruptible sleep (waiting for an event to complete) T stopped by job control signal t stopped by debugger during the tracing W paging (not valid since the 2.6.xx kernel) X dead (should never be seen) Z defunct ("zombie") process, terminated but not reaped by its parent For BSD formats and when the stat keyword is used, additional characters may be displayed: < high-priority (not nice to other users) N low-priority (nice to other users) L has pages locked into memory (for real-time and custom IO) s is a session leader l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do) + is in the foreground process group
- socket listen:
➜ ~ sudo nsenter -n -t $PID ss -tpna | grep $PIDLISTEN 0 1 [::ffff:127.0.0.1]:8005 *:* users:(("java",pid=38123,fd=49)) LISTEN 0 100 *:8080 *:* users:(("java",pid=38123,fd=43))
gdb attach
➜ ~ sudo gdb -p $PID...# 上面能够看出,java 过程应用了 libc0x00007ffff7dea197 in ?? () from target:/lib/x86_64-linux-gnu/libc.so.6➜ ~ ps -f $PID# STAT 变为 `tsl`UID PID PPID C STIME TTY STAT TIME CMDroot 38123 37929 0 07:06 ? tsl 0:06 /opt/java/openjdk/bin/java -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Dj
close fd
# fd 0 到 2 作为 stdin/stdout/stderr 不敞开,其它,从 3 开始敞开。# 下面的 ls -l /proc/$PID/fd 看到,最大 fd 号是 49(gdb) set $max=49(gdb) set $current=3(gdb) while ($current <= $max) > p (int) close($current++) >end # 在我的环境,须要屡次执行下面的循环,才真正敞开了所有 fd,不晓得为何。
➜ ~ sudo ls -l /proc/$PID/fdtotal 0# socket fd 曾经被敞开了lrwx------ 1 root root 64 Jun 18 07:06 0 -> /dev/nulll-wx------ 1 root root 64 Jun 18 07:06 1 -> 'pipe:[181879]'l-wx------ 1 root root 64 Jun 18 07:06 2 -> 'pipe:[181880]'
调用 libc 的 execl,替换过程
函数阐明:man exec
# 调用的参数来源于之前 ps -f $PID 的输入。留神:函数调用的第一个参数是可执行文件(ELF) 的门路,第二个参数是过程名,不是咱们个别命令行的参数,最初一个参数须要补 0 。# 为验证不便,我加了一个参数:-DgdbRestarted=truep (int) execl("/opt/java/openjdk/bin/java", "java", "-DgdbRestarted=true", "-Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties", "-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager", "-Djdk.tls.ephemeralDHKeySize=2048", "-Djava.protocol.handler.pkgs=org.apache.catalina.webresources", "-Dorg.apache.catalina.security.SecurityListener.UMASK=0027", "-Dignore.endorsed.dirs=", "-classpath", "/usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.jar", "-Dcatalina.base=/usr/local/tomcat", "-Dcatalina.home=/usr/local/tomcat", "-Djava.io.tmpdir=/usr/local/tomcat/temp", "org.apache.catalina.startup.Bootstrap", "start",0)
查看
➜ ~ ps -f $PID | catUID PID PPID C STIME TTY STAT TIME CMDroot 38123 37929 0 07:06 ? ts 0:07 java -DgdbRestarted=true -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dorg.apache.catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp org.apache.catalina.startup.Bootstrap start
gdb detach
(gdb) detach
替换过程后的查看
➜ ~ ps -f $PID | catUID PID PPID C STIME TTY STAT TIME CMDroot 38123 37929 0 07:06 ? Ssl 0:10 java -DgdbRestarted=true -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dorg.apache.catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs= -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp org.apache.catalina.startup.Bootstrap start➜ ~ sudo ls -l /proc/$PID/fd total 0lrwx------ 1 root root 64 Jun 18 07:46 0 -> /dev/nulll-wx------ 1 root root 64 Jun 18 07:46 1 -> 'pipe:[181879]'l-wx------ 1 root root 64 Jun 18 07:50 10 -> /usr/local/tomcat/logs/host-manager.2023-06-18.loglr-x------ 1 root root 64 Jun 18 07:50 11 -> /usr/local/tomcat/lib/jaspic-api.jar...lr-x------ 1 root root 64 Jun 18 07:50 17 -> /usr/local/tomcat/lib/el-api.jarlr-x------ 1 root root 64 Jun 18 07:50 18 -> /usr/local/tomcat/lib/ecj-4.20.jarlr-x------ 1 root root 64 Jun 18 07:50 19 -> /usr/local/tomcat/lib/tomcat-i18n-es.jarl-wx------ 1 root root 64 Jun 18 07:46 2 -> 'pipe:[181880]'lr-x------ 1 root root 64 Jun 18 07:50 20 -> /usr/local/tomcat/lib/tomcat-i18n-cs.jarlr-x------ 1 root root 64 Jun 18 07:50 21 -> /usr/local/tomcat/lib/annotations-api.jarlr-x------ 1 root root 64 Jun 18 07:50 22 -> /usr/local/tomcat/lib/servlet-api.jar...lr-x------ 1 root root 64 Jun 18 07:50 42 -> /usr/local/tomcat/lib/tomcat-i18n-ru.jarlrwx------ 1 root root 64 Jun 18 07:50 43 -> 'socket:[282010]'lrwx------ 1 root root 64 Jun 18 07:50 44 -> 'socket:[281414]'l-wx------ 1 root root 64 Jun 18 07:50 45 -> /usr/local/tomcat/logs/localhost_access_log.2023-06-18.txtlrwx------ 1 root root 64 Jun 18 07:50 46 -> 'anon_inode:[eventpoll]'lrwx------ 1 root root 64 Jun 18 07:50 47 -> 'anon_inode:[eventfd]'lrwx------ 1 root root 64 Jun 18 07:50 49 -> 'socket:[282022]'lr-x------ 1 root root 64 Jun 18 07:50 5 -> /usr/local/tomcat/bin/bootstrap.jarlr-x------ 1 root root 64 Jun 18 07:50 6 -> /usr/local/tomcat/bin/commons-daemon.jarl-wx------ 1 root root 64 Jun 18 07:50 7 -> /usr/local/tomcat/logs/catalina.2023-06-18.logl-wx------ 1 root root 64 Jun 18 07:50 8 -> /usr/local/tomcat/logs/localhost.2023-06-18.logl-wx------ 1 root root 64 Jun 18 07:50 9 -> /usr/local/tomcat/logs/manager.2023-06-18.log➜ ~ sudo nsenter -n -t $PID ss -tpna | grep $PIDLISTEN 0 1 [::ffff:127.0.0.1]:8005 *:* users:(("java",pid=38123,fd=49)) LISTEN 0 100 *:8080 *:* users:(("java",pid=38123,fd=43))
kubectl logs tomcat-worknode5-0 | lessNOTE: Picked up JDK_JAVA_OPTIONS: --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED18-Jun-2023 07:06:57.361 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server version name: Apache Tomcat/9.0.7618-Jun-2023 07:06:57.388 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server built: Jun 5 2023 07:17:04 UTC18-Jun-2023 07:06:57.389 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server version number: 9.0.76.018-Jun-2023 07:06:57.389 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Name: Linux18-Jun-2023 07:06:57.390 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Version: 5.4.0-152-generic18-Jun-2023 07:06:57.390 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Architecture: amd6418-Jun-2023 07:06:57.390 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Java Home: /opt/java/openjdk18-Jun-2023 07:06:57.390 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log JVM Version: 17.0.7+7...18-Jun-2023 07:06:59.137 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["http-nio-8080"]18-Jun-2023 07:06:59.354 INFO [main] org.apache.catalina.startup.Catalina.load Server initialization in [2996] milliseconds18-Jun-2023 07:06:59.688 INFO [main] org.apache.catalina.core.StandardService.startInternal Starting service [Catalina]18-Jun-2023 07:06:59.689 INFO [main] org.apache.catalina.core.StandardEngine.startInternal Starting Servlet engine: [Apache Tomcat/9.0.76]18-Jun-2023 07:06:59.771 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["http-nio-8080"]18-Jun-2023 07:06:59.845 INFO [main] org.apache.catalina.startup.Catalina.start Server startup in [490] milliseconds------------- 替换后: --------------NOTE: Picked up JDK_JAVA_OPTIONS: --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED18-Jun-2023 07:49:57.887 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server version name: Apache Tomcat/9.0.7618-Jun-2023 07:49:57.894 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server built: Jun 5 2023 07:17:04 UTC18-Jun-2023 07:49:57.894 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server version number: 9.0.76.018-Jun-2023 07:49:57.895 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Name: Linux18-Jun-2023 07:49:57.898 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Version: 5.4.0-152-generic18-Jun-2023 07:49:57.899 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Architecture: amd6418-Jun-2023 07:49:57.899 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Java Home: /opt/java/openjdk18-Jun-2023 07:49:57.900 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log JVM Version: 17.0.7+7...18-Jun-2023 07:49:58.777 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["http-nio-8080"]18-Jun-2023 07:49:58.853 INFO [main] org.apache.catalina.startup.Catalina.load Server initialization in [1340] milliseconds18-Jun-2023 07:49:58.944 INFO [main] org.apache.catalina.core.StandardService.startInternal Starting service [Catalina]18-Jun-2023 07:49:58.944 INFO [main] org.apache.catalina.core.StandardEngine.startInternal Starting Servlet engine: [Apache Tomcat/9.0.76]18-Jun-2023 07:49:58.976 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["http-nio-8080"]18-Jun-2023 07:49:59.009 INFO [main] org.apache.catalina.startup.Catalina.start Server startup in [155] milliseconds
一些应用限度
因为这个办法应用了 libc ,要求指标过程应用了 libc。大多数执行文件会应用 libc 的:
➜ ~ sudo ldd /proc/$PID/root/opt/java/openjdk/bin/java linux-vdso.so.1 (0x00007ffff7fce000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7f9e000) libjli.so => /proc/38123/root/opt/java/openjdk/bin/../lib/libjli.so (0x00007ffff7f8b000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff7f68000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffff7f62000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7d70000) /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fcf000)
但也有一些例外,起码包含:
- Alpine Linux docker image 的 libc 比拟特地,我的 gdb 无奈辨认
- Golang 动态 link 了 libc 的状况