作者:scwang18,次要负责技术架构,在容器云方向颇有钻研。

前言

KubeSphere 是青云开源的基于 Kubernetes 的云原生分布式操作系统,提供了比拟炫酷的 Kubernetes 集群治理界面,咱们团队用 KubeSphere 来作为开发平台。

本文记录了一次 KubeSphere 环境下的网络故障的解决过程。

景象

开发同学反馈本人搭建的 Harbor 仓库总是出问题,偶然会报 net/http: TLS handshake timeout , 通过 curl 的形式拜访 harbor.xxxx.cn ,也会随机频繁挂起。然而 ping 的反馈一切正常。

起因剖析

接到谬误报障后,通过了多轮剖析,才最终定位到起因,应该是装置 KubeSphere 时,应用了最新版的 Kubernetes 1.23.1 。

尽管应用 ./kk version --show-supported-k8s 能够看到 KubeSphere 3.2.1 能够反对 Kubernetes 1.23.1 ,但实际上只是试验性反对,有坑的。

剖析过程如下:

  1. 呈现 Harbor registry 拜访问题,下意识认为是 Harbor 部署有问题,然而在查看 Harbor core 的日志的时候,没有看到异样时有相应错误信息,甚至 info 级别的日志信息都没有。
  2. 又把指标放在 Harbor portal, 查看拜访日志,一样没有发现异常信息。
  3. 依据拜访链,持续追究 kubesphere-router-kubesphere-system , 即 KubeSphere 版的 nginx ingress controller ,同样没有发现异常日志。
  4. 尝试在集群内其余 Pod 里拜访 Harbor 的集群内 Service 地址,发现不会呈现拜访超时问题。初步判断是 KubeSphere 自带的 Ingress 的问题。
  5. 把 kubeSphere 自带的 Ingress Controller 敞开,装置 Kubernetes 官网举荐的 ingress-nginx-controller 版本, 故障仍旧,而且 Ingress 日志里也没有发现异常信息。
  6. 综合下面的剖析,问题应该呈现在客户端到 Ingress Controller 之间,我的 Ingress Controller 是通过 NodePort 形式裸露到集群里面。因而,测试其余通过 NodePort 裸露到集群外的 service,发现是一样的故障,至此,能够齐全排除 Harbor 部署问题了,根本确定是客户端到 Ingress Controller 的问题。
  7. 内部客户端通过 NodePort 拜访 Ingress Controller 时,会通过 kube-proxy 组件,剖析 kube-proxy 的日志,发现告警信息
can’t set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1

这个告警信息是因为我的 centos 7.6 的内核版本过低, 以后是 3.10.0-1160.21.1.el7.x86_64 ,与 Kubernetes 新版的 ipvs 存在兼容性问题。

能够通过降级操作系统的 kernel 版本能够解决。

  1. 降级完 kernel 后,Calico 启动不了,报以下错误信息

    ipset v7.1: kernel and userspace incompatible: settype hash:ip,port with revision 6 not supported by userspace.

起因是装置 KubeSphere 时默认装置的 Calico 版本是 v3.20.0 , 这个版本不反对最新版的 Linux Kernel ,降级后的内核版本是 5.18.1-1.el7.elrepo.x86_64,calico 须要降级到 v3.23.0 以上版本。

  1. 降级完 Calico 版本后,Calico 持续报错

    user "system:serviceaccount:kube-system:calico-node" cannot list resource "caliconodestatuses" in api group "crd.projectcalico.org"

还有另外一个错误信息,都是因为 clusterrole 的资源权限有余,能够通过批改 clusterrole 来解决问题。

  1. 至此,该莫名其妙的网络问题解决了。

解决过程

依据下面的剖析,次要解决方案如下:

降级操作系统内核

  1. 应用阿里云的 yum 源
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repoyum clean all && yum -y update
  1. 启用 elrepo 仓库
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.orgrpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
  1. 装置最新版本内核
yum --enablerepo=elrepo-kernel install kernel-ml
  1. 查看零碎上的所有可用内核
awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
  1. 设置新的内核为 grub2 的默认版本

查看第4步返回的零碎可用内核列表,不出意外第1个应该是最新装置的内核。

grub2-set-default 0
  1. 生成 grub 配置文件并重启
grub2-mkconfig -o /boot/grub2/grub.cfgreboot now
  1. 验证
uname -r

降级 Calico

Kubernetes 上的 Calico 个别是应用 Daemonset 形式部署,我的集群里,Calico 的 Daemonset 名字是 calico-node。

间接输入为 yaml 文件,批改文件里的所有 image 版本号为最新版本 v3.23.1 。从新创立 Daemonset。

  1. 输入 yaml
kubectl -n kube-system get ds  calico-node -o yaml>calico-node.yaml
  1. calico-node.yaml:
apiVersion: apps/v1kind: DaemonSetmetadata:  labels:    k8s-app: calico-node  name: calico-node  namespace: kube-systemspec:  revisionHistoryLimit: 10  selector:    matchLabels:      k8s-app: calico-node  template:    metadata:      creationTimestamp: null      labels:        k8s-app: calico-node    spec:      containers:      - env:        - name: DATASTORE_TYPE          value: kubernetes        - name: WAIT_FOR_DATASTORE          value: "true"        - name: NODENAME          valueFrom:            fieldRef:              apiVersion: v1              fieldPath: spec.nodeName        - name: CALICO_NETWORKING_BACKEND          valueFrom:            configMapKeyRef:              key: calico_backend              name: calico-config        - name: CLUSTER_TYPE          value: k8s,bgp        - name: NODEIP          valueFrom:            fieldRef:              apiVersion: v1              fieldPath: status.hostIP        - name: IP_AUTODETECTION_METHOD          value: can-reach=$(NODEIP)        - name: IP          value: autodetect        - name: CALICO_IPV4POOL_IPIP          value: Always        - name: CALICO_IPV4POOL_VXLAN          value: Never        - name: FELIX_IPINIPMTU          valueFrom:            configMapKeyRef:              key: veth_mtu              name: calico-config        - name: FELIX_VXLANMTU          valueFrom:            configMapKeyRef:              key: veth_mtu              name: calico-config        - name: FELIX_WIREGUARDMTU          valueFrom:            configMapKeyRef:              key: veth_mtu              name: calico-config        - name: CALICO_IPV4POOL_CIDR          value: 10.233.64.0/18        - name: CALICO_IPV4POOL_BLOCK_SIZE          value: "24"        - name: CALICO_DISABLE_FILE_LOGGING          value: "true"        - name: FELIX_DEFAULTENDPOINTTOHOSTACTION          value: ACCEPT        - name: FELIX_IPV6SUPPORT          value: "false"        - name: FELIX_HEALTHENABLED          value: "true"        envFrom:        - configMapRef:            name: kubernetes-services-endpoint            optional: true        image: calico/node:v3.23.1        imagePullPolicy: IfNotPresent        livenessProbe:          exec:            command:            - /bin/calico-node            - -felix-live            - -bird-live          failureThreshold: 6          initialDelaySeconds: 10          periodSeconds: 10          successThreshold: 1          timeoutSeconds: 10        name: calico-node        readinessProbe:          exec:            command:            - /bin/calico-node            - -felix-ready            - -bird-ready          failureThreshold: 3          periodSeconds: 10          successThreshold: 1          timeoutSeconds: 10        resources:          requests:            cpu: 250m        securityContext:          privileged: true        terminationMessagePath: /dev/termination-log        terminationMessagePolicy: File        volumeMounts:        - mountPath: /host/etc/cni/net.d          name: cni-net-dir        - mountPath: /lib/modules          name: lib-modules          readOnly: true        - mountPath: /run/xtables.lock          name: xtables-lock        - mountPath: /var/run/calico          name: var-run-calico        - mountPath: /var/lib/calico          name: var-lib-calico        - mountPath: /var/run/nodeagent          name: policysync        - mountPath: /sys/fs/          mountPropagation: Bidirectional          name: sysfs        - mountPath: /var/log/calico/cni          name: cni-log-dir          readOnly: true      dnsPolicy: ClusterFirst      hostNetwork: true      initContainers:      - command:        - /opt/cni/bin/calico-ipam        - -upgrade        env:        - name: KUBERNETES_NODE_NAME          valueFrom:            fieldRef:              apiVersion: v1              fieldPath: spec.nodeName        - name: CALICO_NETWORKING_BACKEND          valueFrom:            configMapKeyRef:              key: calico_backend              name: calico-config        envFrom:        - configMapRef:            name: kubernetes-services-endpoint            optional: true        image: calico/cni:v3.23.1        imagePullPolicy: IfNotPresent        name: upgrade-ipam        resources: {}        securityContext:          privileged: true        terminationMessagePath: /dev/termination-log        terminationMessagePolicy: File        volumeMounts:        - mountPath: /var/lib/cni/networks          name: host-local-net-dir        - mountPath: /host/opt/cni/bin          name: cni-bin-dir      - command:        - /opt/cni/bin/install        env:        - name: CNI_CONF_NAME          value: 10-calico.conflist        - name: CNI_NETWORK_CONFIG          valueFrom:            configMapKeyRef:              key: cni_network_config              name: calico-config        - name: KUBERNETES_NODE_NAME          valueFrom:            fieldRef:              apiVersion: v1              fieldPath: spec.nodeName        - name: CNI_MTU          valueFrom:            configMapKeyRef:              key: veth_mtu              name: calico-config        - name: SLEEP          value: "false"        envFrom:        - configMapRef:            name: kubernetes-services-endpoint            optional: true        image: calico/cni:v3.23.1        imagePullPolicy: IfNotPresent        name: install-cni        resources: {}        securityContext:          privileged: true        terminationMessagePath: /dev/termination-log        terminationMessagePolicy: File        volumeMounts:        - mountPath: /host/opt/cni/bin          name: cni-bin-dir        - mountPath: /host/etc/cni/net.d          name: cni-net-dir      - image: calico/pod2daemon-flexvol:v3.23.1        imagePullPolicy: IfNotPresent        name: flexvol-driver        resources: {}        securityContext:          privileged: true        terminationMessagePath: /dev/termination-log        terminationMessagePolicy: File        volumeMounts:        - mountPath: /host/driver          name: flexvol-driver-host      nodeSelector:        kubernetes.io/os: linux      priorityClassName: system-node-critical      restartPolicy: Always      schedulerName: default-scheduler      securityContext: {}      serviceAccount: calico-node      serviceAccountName: calico-node      terminationGracePeriodSeconds: 0      tolerations:      - effect: NoSchedule        operator: Exists      - key: CriticalAddonsOnly        operator: Exists      - effect: NoExecute        operator: Exists      volumes:      - hostPath:          path: /lib/modules          type: ""        name: lib-modules      - hostPath:          path: /var/run/calico          type: ""        name: var-run-calico      - hostPath:          path: /var/lib/calico          type: ""        name: var-lib-calico      - hostPath:          path: /run/xtables.lock          type: FileOrCreate        name: xtables-lock      - hostPath:          path: /sys/fs/          type: DirectoryOrCreate        name: sysfs      - hostPath:          path: /opt/cni/bin          type: ""        name: cni-bin-dir      - hostPath:          path: /etc/cni/net.d          type: ""        name: cni-net-dir      - hostPath:          path: /var/log/calico/cni          type: ""        name: cni-log-dir      - hostPath:          path: /var/lib/cni/networks          type: ""        name: host-local-net-dir      - hostPath:          path: /var/run/nodeagent          type: DirectoryOrCreate        name: policysync      - hostPath:          path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds          type: DirectoryOrCreate        name: flexvol-driver-host  updateStrategy:    rollingUpdate:      maxSurge: 0      maxUnavailable: 1    type: RollingUpdate

ClusterRole

还须要批改 ClusterRole ,否则 Calico 会始终报权限错。

  1. 输入 yaml
kubectl get clusterrole calico-node -o yaml >calico-node-clusterrole.yaml
  1. calico-node-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:  name: calico-noderules:- apiGroups:  - ""  resources:  - pods  - nodes  - namespaces  verbs:  - get- apiGroups:  - discovery.k8s.io  resources:  - endpointslices  verbs:  - watch  - list- apiGroups:  - ""  resources:  - endpoints  - services  verbs:  - watch  - list  - get- apiGroups:  - ""  resources:  - configmaps  verbs:  - get- apiGroups:  - ""  resources:  - nodes/status  verbs:  - patch  - update- apiGroups:  - networking.k8s.io  resources:  - networkpolicies  verbs:  - watch  - list- apiGroups:  - ""  resources:  - pods  - namespaces  - serviceaccounts  verbs:  - list  - watch- apiGroups:  - ""  resources:  - pods/status  verbs:  - patch- apiGroups:  - crd.projectcalico.org  resources:  - globalfelixconfigs  - felixconfigurations  - bgppeers  - globalbgpconfigs  - bgpconfigurations  - ippools  - ipamblocks  - globalnetworkpolicies  - globalnetworksets  - networkpolicies  - networksets  - clusterinformations  - hostendpoints  - blockaffinities  - caliconodestatuses  - ipreservations  verbs:  - get  - list  - watch- apiGroups:  - crd.projectcalico.org  resources:  - ippools  - felixconfigurations  - clusterinformations  verbs:  - create  - update- apiGroups:  - ""  resources:  - nodes  verbs:  - get  - list  - watch- apiGroups:  - crd.projectcalico.org  resources:  - bgpconfigurations  - bgppeers  verbs:  - create  - update- apiGroups:  - crd.projectcalico.org  resources:  - blockaffinities  - ipamblocks  - ipamhandles  verbs:  - get  - list  - create  - update  - delete- apiGroups:  - crd.projectcalico.org  resources:  - ipamconfigs  verbs:  - get- apiGroups:  - crd.projectcalico.org  resources:  - blockaffinities  verbs:  - watch- apiGroups:  - apps  resources:  - daemonsets  verbs:  - get

总结

这次奇怪的网络故障,最终起因还是因为 KubeSphere 的版本与 Kubernetes 的版本不匹配。所以工作环境要稳字为先,不要冒进应用最新的版本。否则会耽误很多工夫来解决莫名其妙的问题。

本文由博客一文多发平台 OpenWrite 公布!