关于云计算:Kubernetes-集群中-Ingress-故障的根因诊断

2次阅读

共计 9116 个字符,预计需要花费 23 分钟才能阅读完成。

作者:scwang18,次要负责技术架构,在容器云方向颇有钻研。

前言

KubeSphere 是青云开源的基于 Kubernetes 的云原生分布式操作系统,提供了比拟炫酷的 Kubernetes 集群治理界面,咱们团队用 KubeSphere 来作为开发平台。

本文记录了一次 KubeSphere 环境下的网络故障的解决过程。

景象

开发同学反馈本人搭建的 Harbor 仓库总是出问题,偶然会报 net/http: TLS handshake timeout,通过 curl 的形式拜访 harbor.xxxx.cn,也会随机频繁挂起。然而 ping 的反馈一切正常。

起因剖析

接到谬误报障后,通过了多轮剖析,才最终定位到起因,应该是装置 KubeSphere 时,应用了最新版的 Kubernetes 1.23.1。

尽管应用 ./kk version --show-supported-k8s 能够看到 KubeSphere 3.2.1 能够反对 Kubernetes 1.23.1,但实际上只是试验性反对,有坑的。

剖析过程如下:

  1. 呈现 Harbor registry 拜访问题,下意识认为是 Harbor 部署有问题,然而在查看 Harbor core 的日志的时候,没有看到异样时有相应错误信息,甚至 info 级别的日志信息都没有。
  2. 又把指标放在 Harbor portal,查看拜访日志,一样没有发现异常信息。
  3. 依据拜访链,持续追究 kubesphere-router-kubesphere-system,即 KubeSphere 版的 nginx ingress controller,同样没有发现异常日志。
  4. 尝试在集群内其余 Pod 里拜访 Harbor 的集群内 Service 地址,发现不会呈现拜访超时问题。初步判断是 KubeSphere 自带的 Ingress 的问题。
  5. 把 kubeSphere 自带的 Ingress Controller 敞开,装置 Kubernetes 官网举荐的 ingress-nginx-controller 版本,故障仍旧,而且 Ingress 日志里也没有发现异常信息。
  6. 综合下面的剖析,问题应该呈现在客户端到 Ingress Controller 之间,我的 Ingress Controller 是通过 NodePort 形式裸露到集群里面。因而,测试其余通过 NodePort 裸露到集群外的 service,发现是一样的故障,至此,能够齐全排除 Harbor 部署问题了,根本确定是客户端到 Ingress Controller 的问题。
  7. 内部客户端通过 NodePort 拜访 Ingress Controller 时,会通过 kube-proxy 组件,剖析 kube-proxy 的日志,发现告警信息
can’t set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1

这个告警信息是因为我的 centos 7.6 的内核版本过低,以后是 3.10.0-1160.21.1.el7.x86_64,与 Kubernetes 新版的 ipvs 存在兼容性问题。

能够通过降级操作系统的 kernel 版本能够解决。

  1. 降级完 kernel 后,Calico 启动不了,报以下错误信息

    ipset v7.1: kernel and userspace incompatible: settype hash:ip,port with revision 6 not supported by userspace.

起因是装置 KubeSphere 时默认装置的 Calico 版本是 v3.20.0 , 这个版本不反对最新版的 Linux Kernel,降级后的内核版本是 5.18.1-1.el7.elrepo.x86_64,calico 须要降级到 v3.23.0 以上版本。

  1. 降级完 Calico 版本后,Calico 持续报错

    user "system:serviceaccount:kube-system:calico-node" cannot list resource "caliconodestatuses" in api group "crd.projectcalico.org"

还有另外一个错误信息,都是因为 clusterrole 的资源权限有余,能够通过批改 clusterrole 来解决问题。

  1. 至此,该莫名其妙的网络问题解决了。

解决过程

依据下面的剖析,次要解决方案如下:

降级操作系统内核

  1. 应用阿里云的 yum 源
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum clean all && yum -y update
  1. 启用 elrepo 仓库
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
  1. 装置最新版本内核
yum --enablerepo=elrepo-kernel install kernel-ml
  1. 查看零碎上的所有可用内核
awk -F\''$1=="menuentry "{print i++" : "$2}' /etc/grub2.cfg
  1. 设置新的内核为 grub2 的默认版本

查看第 4 步返回的零碎可用内核列表,不出意外第 1 个应该是最新装置的内核。

grub2-set-default 0
  1. 生成 grub 配置文件并重启
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot now
  1. 验证
uname -r

降级 Calico

Kubernetes 上的 Calico 个别是应用 Daemonset 形式部署,我的集群里,Calico 的 Daemonset 名字是 calico-node。

间接输入为 yaml 文件,批改文件里的所有 image 版本号为最新版本 v3.23.1。从新创立 Daemonset。

  1. 输入 yaml
kubectl -n kube-system get ds  calico-node -o yaml>calico-node.yaml
  1. calico-node.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: calico-node
  name: calico-node
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: calico-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: calico-node
    spec:
      containers:
      - env:
        - name: DATASTORE_TYPE
          value: kubernetes
        - name: WAIT_FOR_DATASTORE
          value: "true"
        - name: NODENAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        - name: CLUSTER_TYPE
          value: k8s,bgp
        - name: NODEIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: IP_AUTODETECTION_METHOD
          value: can-reach=$(NODEIP)
        - name: IP
          value: autodetect
        - name: CALICO_IPV4POOL_IPIP
          value: Always
        - name: CALICO_IPV4POOL_VXLAN
          value: Never
        - name: FELIX_IPINIPMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_VXLANMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_WIREGUARDMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: CALICO_IPV4POOL_CIDR
          value: 10.233.64.0/18
        - name: CALICO_IPV4POOL_BLOCK_SIZE
          value: "24"
        - name: CALICO_DISABLE_FILE_LOGGING
          value: "true"
        - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
          value: ACCEPT
        - name: FELIX_IPV6SUPPORT
          value: "false"
        - name: FELIX_HEALTHENABLED
          value: "true"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/node:v3.23.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-live
            - -bird-live
          failureThreshold: 6
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: calico-node
        readinessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-ready
            - -bird-ready
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          requests:
            cpu: 250m
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /run/xtables.lock
          name: xtables-lock
        - mountPath: /var/run/calico
          name: var-run-calico
        - mountPath: /var/lib/calico
          name: var-lib-calico
        - mountPath: /var/run/nodeagent
          name: policysync
        - mountPath: /sys/fs/
          mountPropagation: Bidirectional
          name: sysfs
        - mountPath: /var/log/calico/cni
          name: cni-log-dir
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - command:
        - /opt/cni/bin/calico-ipam
        - -upgrade
        env:
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: upgrade-ipam
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/cni/networks
          name: host-local-net-dir
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
      - command:
        - /opt/cni/bin/install
        env:
        - name: CNI_CONF_NAME
          value: 10-calico.conflist
        - name: CNI_NETWORK_CONFIG
          valueFrom:
            configMapKeyRef:
              key: cni_network_config
              name: calico-config
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CNI_MTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: SLEEP
          value: "false"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: install-cni
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
      - image: calico/pod2daemon-flexvol:v3.23.1
        imagePullPolicy: IfNotPresent
        name: flexvol-driver
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/driver
          name: flexvol-driver-host
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: calico-node
      serviceAccountName: calico-node
      terminationGracePeriodSeconds: 0
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
      - hostPath:
          path: /var/run/calico
          type: ""
        name: var-run-calico
      - hostPath:
          path: /var/lib/calico
          type: ""
        name: var-lib-calico
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
      - hostPath:
          path: /sys/fs/
          type: DirectoryOrCreate
        name: sysfs
      - hostPath:
          path: /opt/cni/bin
          type: ""
        name: cni-bin-dir
      - hostPath:
          path: /etc/cni/net.d
          type: ""
        name: cni-net-dir
      - hostPath:
          path: /var/log/calico/cni
          type: ""
        name: cni-log-dir
      - hostPath:
          path: /var/lib/cni/networks
          type: ""
        name: host-local-net-dir
      - hostPath:
          path: /var/run/nodeagent
          type: DirectoryOrCreate
        name: policysync
      - hostPath:
          path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
          type: DirectoryOrCreate
        name: flexvol-driver-host
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

ClusterRole

还须要批改 ClusterRole,否则 Calico 会始终报权限错。

  1. 输入 yaml
kubectl get clusterrole calico-node -o yaml >calico-node-clusterrole.yaml
  1. calico-node-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: calico-node
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  verbs:
  - watch
  - list
  - get
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  - serviceaccounts
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalfelixconfigs
  - felixconfigurations
  - bgppeers
  - globalbgpconfigs
  - bgpconfigurations
  - ippools
  - ipamblocks
  - globalnetworkpolicies
  - globalnetworksets
  - networkpolicies
  - networksets
  - clusterinformations
  - hostendpoints
  - blockaffinities
  - caliconodestatuses
  - ipreservations
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ippools
  - felixconfigurations
  - clusterinformations
  verbs:
  - create
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - bgpconfigurations
  - bgppeers
  verbs:
  - create
  - update
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  - ipamblocks
  - ipamhandles
  verbs:
  - get
  - list
  - create
  - update
  - delete
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ipamconfigs
  verbs:
  - get
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  verbs:
  - watch
- apiGroups:
  - apps
  resources:
  - daemonsets
  verbs:
  - get

总结

这次奇怪的网络故障,最终起因还是因为 KubeSphere 的版本与 Kubernetes 的版本不匹配。所以工作环境要稳字为先,不要冒进应用最新的版本。否则会耽误很多工夫来解决莫名其妙的问题。

本文由博客一文多发平台 OpenWrite 公布!

正文完
 0