关于云计算:Kubernetes-集群中-Ingress-故障的根因诊断

作者:scwang18,次要负责技术架构,在容器云方向颇有钻研。

前言

KubeSphere 是青云开源的基于 Kubernetes 的云原生分布式操作系统,提供了比拟炫酷的 Kubernetes 集群治理界面,咱们团队用 KubeSphere 来作为开发平台。

本文记录了一次 KubeSphere 环境下的网络故障的解决过程。

景象

开发同学反馈本人搭建的 Harbor 仓库总是出问题,偶然会报 net/http: TLS handshake timeout , 通过 curl 的形式拜访 harbor.xxxx.cn ,也会随机频繁挂起。然而 ping 的反馈一切正常。

起因剖析

接到谬误报障后,通过了多轮剖析,才最终定位到起因,应该是装置 KubeSphere 时,应用了最新版的 Kubernetes 1.23.1 。

尽管应用 ./kk version --show-supported-k8s 能够看到 KubeSphere 3.2.1 能够反对 Kubernetes 1.23.1 ,但实际上只是试验性反对,有坑的。

剖析过程如下:

  1. 呈现 Harbor registry 拜访问题,下意识认为是 Harbor 部署有问题,然而在查看 Harbor core 的日志的时候,没有看到异样时有相应错误信息,甚至 info 级别的日志信息都没有。
  2. 又把指标放在 Harbor portal, 查看拜访日志,一样没有发现异常信息。
  3. 依据拜访链,持续追究 kubesphere-router-kubesphere-system , 即 KubeSphere 版的 nginx ingress controller ,同样没有发现异常日志。
  4. 尝试在集群内其余 Pod 里拜访 Harbor 的集群内 Service 地址,发现不会呈现拜访超时问题。初步判断是 KubeSphere 自带的 Ingress 的问题。
  5. 把 kubeSphere 自带的 Ingress Controller 敞开,装置 Kubernetes 官网举荐的 ingress-nginx-controller 版本, 故障仍旧,而且 Ingress 日志里也没有发现异常信息。
  6. 综合下面的剖析,问题应该呈现在客户端到 Ingress Controller 之间,我的 Ingress Controller 是通过 NodePort 形式裸露到集群里面。因而,测试其余通过 NodePort 裸露到集群外的 service,发现是一样的故障,至此,能够齐全排除 Harbor 部署问题了,根本确定是客户端到 Ingress Controller 的问题。
  7. 内部客户端通过 NodePort 拜访 Ingress Controller 时,会通过 kube-proxy 组件,剖析 kube-proxy 的日志,发现告警信息
can’t set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1

这个告警信息是因为我的 centos 7.6 的内核版本过低, 以后是 3.10.0-1160.21.1.el7.x86_64 ,与 Kubernetes 新版的 ipvs 存在兼容性问题。

能够通过降级操作系统的 kernel 版本能够解决。

  1. 降级完 kernel 后,Calico 启动不了,报以下错误信息

    ipset v7.1: kernel and userspace incompatible: settype hash:ip,port with revision 6 not supported by userspace.

起因是装置 KubeSphere 时默认装置的 Calico 版本是 v3.20.0 , 这个版本不反对最新版的 Linux Kernel ,降级后的内核版本是 5.18.1-1.el7.elrepo.x86_64,calico 须要降级到 v3.23.0 以上版本。

  1. 降级完 Calico 版本后,Calico 持续报错

    user "system:serviceaccount:kube-system:calico-node" cannot list resource "caliconodestatuses" in api group "crd.projectcalico.org"

还有另外一个错误信息,都是因为 clusterrole 的资源权限有余,能够通过批改 clusterrole 来解决问题。

  1. 至此,该莫名其妙的网络问题解决了。

解决过程

依据下面的剖析,次要解决方案如下:

降级操作系统内核

  1. 应用阿里云的 yum 源
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum clean all && yum -y update
  1. 启用 elrepo 仓库
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
  1. 装置最新版本内核
yum --enablerepo=elrepo-kernel install kernel-ml
  1. 查看零碎上的所有可用内核
awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
  1. 设置新的内核为 grub2 的默认版本

查看第4步返回的零碎可用内核列表,不出意外第1个应该是最新装置的内核。

grub2-set-default 0
  1. 生成 grub 配置文件并重启
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot now
  1. 验证
uname -r

降级 Calico

Kubernetes 上的 Calico 个别是应用 Daemonset 形式部署,我的集群里,Calico 的 Daemonset 名字是 calico-node。

间接输入为 yaml 文件,批改文件里的所有 image 版本号为最新版本 v3.23.1 。从新创立 Daemonset。

  1. 输入 yaml
kubectl -n kube-system get ds  calico-node -o yaml>calico-node.yaml
  1. calico-node.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: calico-node
  name: calico-node
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: calico-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: calico-node
    spec:
      containers:
      - env:
        - name: DATASTORE_TYPE
          value: kubernetes
        - name: WAIT_FOR_DATASTORE
          value: "true"
        - name: NODENAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        - name: CLUSTER_TYPE
          value: k8s,bgp
        - name: NODEIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: IP_AUTODETECTION_METHOD
          value: can-reach=$(NODEIP)
        - name: IP
          value: autodetect
        - name: CALICO_IPV4POOL_IPIP
          value: Always
        - name: CALICO_IPV4POOL_VXLAN
          value: Never
        - name: FELIX_IPINIPMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_VXLANMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_WIREGUARDMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: CALICO_IPV4POOL_CIDR
          value: 10.233.64.0/18
        - name: CALICO_IPV4POOL_BLOCK_SIZE
          value: "24"
        - name: CALICO_DISABLE_FILE_LOGGING
          value: "true"
        - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
          value: ACCEPT
        - name: FELIX_IPV6SUPPORT
          value: "false"
        - name: FELIX_HEALTHENABLED
          value: "true"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/node:v3.23.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-live
            - -bird-live
          failureThreshold: 6
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: calico-node
        readinessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-ready
            - -bird-ready
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          requests:
            cpu: 250m
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /run/xtables.lock
          name: xtables-lock
        - mountPath: /var/run/calico
          name: var-run-calico
        - mountPath: /var/lib/calico
          name: var-lib-calico
        - mountPath: /var/run/nodeagent
          name: policysync
        - mountPath: /sys/fs/
          mountPropagation: Bidirectional
          name: sysfs
        - mountPath: /var/log/calico/cni
          name: cni-log-dir
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - command:
        - /opt/cni/bin/calico-ipam
        - -upgrade
        env:
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: upgrade-ipam
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/cni/networks
          name: host-local-net-dir
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
      - command:
        - /opt/cni/bin/install
        env:
        - name: CNI_CONF_NAME
          value: 10-calico.conflist
        - name: CNI_NETWORK_CONFIG
          valueFrom:
            configMapKeyRef:
              key: cni_network_config
              name: calico-config
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CNI_MTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: SLEEP
          value: "false"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: install-cni
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
      - image: calico/pod2daemon-flexvol:v3.23.1
        imagePullPolicy: IfNotPresent
        name: flexvol-driver
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/driver
          name: flexvol-driver-host
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: calico-node
      serviceAccountName: calico-node
      terminationGracePeriodSeconds: 0
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
      - hostPath:
          path: /var/run/calico
          type: ""
        name: var-run-calico
      - hostPath:
          path: /var/lib/calico
          type: ""
        name: var-lib-calico
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
      - hostPath:
          path: /sys/fs/
          type: DirectoryOrCreate
        name: sysfs
      - hostPath:
          path: /opt/cni/bin
          type: ""
        name: cni-bin-dir
      - hostPath:
          path: /etc/cni/net.d
          type: ""
        name: cni-net-dir
      - hostPath:
          path: /var/log/calico/cni
          type: ""
        name: cni-log-dir
      - hostPath:
          path: /var/lib/cni/networks
          type: ""
        name: host-local-net-dir
      - hostPath:
          path: /var/run/nodeagent
          type: DirectoryOrCreate
        name: policysync
      - hostPath:
          path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
          type: DirectoryOrCreate
        name: flexvol-driver-host
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

ClusterRole

还须要批改 ClusterRole ,否则 Calico 会始终报权限错。

  1. 输入 yaml
kubectl get clusterrole calico-node -o yaml >calico-node-clusterrole.yaml
  1. calico-node-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: calico-node
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  verbs:
  - watch
  - list
  - get
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  - serviceaccounts
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalfelixconfigs
  - felixconfigurations
  - bgppeers
  - globalbgpconfigs
  - bgpconfigurations
  - ippools
  - ipamblocks
  - globalnetworkpolicies
  - globalnetworksets
  - networkpolicies
  - networksets
  - clusterinformations
  - hostendpoints
  - blockaffinities
  - caliconodestatuses
  - ipreservations
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ippools
  - felixconfigurations
  - clusterinformations
  verbs:
  - create
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - bgpconfigurations
  - bgppeers
  verbs:
  - create
  - update
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  - ipamblocks
  - ipamhandles
  verbs:
  - get
  - list
  - create
  - update
  - delete
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ipamconfigs
  verbs:
  - get
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  verbs:
  - watch
- apiGroups:
  - apps
  resources:
  - daemonsets
  verbs:
  - get

总结

这次奇怪的网络故障,最终起因还是因为 KubeSphere 的版本与 Kubernetes 的版本不匹配。所以工作环境要稳字为先,不要冒进应用最新的版本。否则会耽误很多工夫来解决莫名其妙的问题。

本文由博客一文多发平台 OpenWrite 公布!

评论

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

这个站点使用 Akismet 来减少垃圾评论。了解你的评论数据如何被处理