关于云计算:使用-GPUOperator-与-KubeSphere-简化深度学习训练与监控-GPU

本文将从 GPU-Operator 概念介绍、装置部署、深度训练测试利用部署，以及在 KubeSphere 应用自定义监控面板对接 GPU 监控，从原理到实际，逐渐浅析介绍与实际 GPU-Operator。

家喻户晓，Kubernetes 平台通过设施插件框架提供对非凡硬件资源的拜访，如 NVIDIA GPU、网卡、Infiniband 适配器和其余设施。然而，应用这些硬件资源配置和治理节点须要配置多个软件组件，如驱动程序、容器运行时或其余依赖库，这是艰难的和容易出错的。

NVIDIA GPU Operator 由 Nvidia 公司开源，利用了 Kubernetes 平台的 Operator 管制模式，不便地自动化集成治理 GPU 所需的 NVIDIA 设施组件，无效地解决了上述 GPU 设施集成的痛点。这些组件包含 NVIDIA 驱动程序(用于启用 CUDA)、用于 GPU 的 Kubernetes 设施插件、NVIDIA Container 运行时、主动节点标签、基于 DCGM 的监控等。

NVIDIA GPU Operator 的不仅实现了设施和组件一体化集成，而且它治理 GPU 节点就像治理 CPU 节点一样不便，无需独自为 GPU 节点提供非凡的操作系统。值得关注的是，它将 GPU 各组件容器化，提供 GPU 能力，非常适合疾速扩大和治理规模 GPU 节点。当然，对于曾经为 GPU 组件构建了非凡操作系统的利用场景来说，显得并不是那么适合了。

前文提到，NVIDIA GPU Operator 治理 GPU 节点就像治理 CPU 节点一样不便，那么它是如何实现这一能力呢？

咱们一起来看看 GPU-Operator 运行时的架构图：

通过图中的形容，咱们能够晓得，GPU-Operator 是通过实现了 Nvidia 容器运行时，以 runC 作为输出，在 runC 中preStart hook中注入了一个名叫 nvidia-container-toolkit 的脚本，该脚本调用 libnvidia-container CLI 设置一系列适合的flags，使得容器运行后具备 GPU 能力。

在装置 GPU Operator 之前，请配置好装置环境如下：

所有节点 不须要 事后装置 NVIDIA 组件(driver,container runtime,device plugin)；
所有节点必须配置Docker,cri-o, 或者containerd. 对于 docker 来说，能够参考这里；
如果应用 HWE 内核 (e.g. kernel 5.x) 的 Ubuntu 18.04 LTS 环境下, 须要给nouveau driver 增加黑名单，须要更新initramfs；

$ sudo vim /etc/modprobe.d/blacklist.conf # 在尾部增加黑名单
blacklist nouveau
options nouveau modeset=0
$ sudo update-initramfs -u
$ reboot
$ lsmod | grep nouveau # 验证 nouveau 是否已禁用
$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c  #本文测试时处理器架构代号为 Broadwell
16 Intel Core Processor (Broadwell)

节点发现 (NFD) 须要在每个节点上配置，默认状况会间接装置，如果曾经配置，请在Helm chart 变量设置 nfd.enabled 为false, 再装置;
如果应用 Kubernetes 1.13 和 1.14, 须要激活 KubeletPodResources；

OS Name / Version	Identifier	amd64 / x86_64	ppc64le	arm64 / aarch64
Amazon Linux 1	amzn1	X
Amazon Linux 2	amzn2	X
Amazon Linux 2017.09	amzn2017.09	X
Amazon Linux 2018.03	amzn2018.03	X
Open Suse Leap 15.0	sles15.0	X
Open Suse Leap 15.1	sles15.1	X
Debian Linux 9	debian9	X
Debian Linux 10	debian10	X
Centos 7	centos7	X	X
Centos 8	centos8	X	X	X
RHEL 7.4	rhel7.4	X	X
RHEL 7.5	rhel7.5	X	X
RHEL 7.6	rhel7.6	X	X
RHEL 7.7	rhel7.7	X	X
RHEL 8.0	rhel8.0	X	X	X
RHEL 8.1	rhel8.1	X	X	X
RHEL 8.2	rhel8.2	X	X	X
Ubuntu 16.04	ubuntu16.04	X	X
Ubuntu 18.04	ubuntu18.04	X	X	X
Ubuntu 20.04	ubuntu20.04	X	X	X

OS Name / Version	amd64 / x86_64	ppc64le	arm64 / aarch64
Docker 18.09	X	X	X
Docker 19.03	X	X	X
RHEL/CentOS 8 podman	X
CentOS 8 Docker	X
RHEL/CentOS 7 Docker	X

可参考 Docker 官网文档

配置 stable 仓库和 GPG key :

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

更新软件仓库后装置 nvidia-docker2 并增加运行时配置：

$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
-----
What would you like to do about it ?  Your options are:
Y or I  : install the package maintainer's version
N or O  : keep your currently-installed version
D     : show the differences between the versions
Z     : start a shell to examine the situation
-----
# 首次装置，遇到以上交互式问题可抉择 N
# 如果抉择 Y 会笼罩你的一些默认配置
# 抉择 N 后，将以下配置增加到 etc/docker/daemon.json
{
  "runtimes": {
      "nvidia": {
          "path": "/usr/bin/nvidia-container-runtime",
          "runtimeArgs": []}
  }
}

重启docker:

$ sudo systemctl restart docker

$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
   && chmod 700 get_helm.sh \
   && ./get_helm.sh

增加 helm 仓库

$ helm repo add nvidia https://nvidia.github.io/gpu-operator \
   && helm repo update

$ kubectl create ns gpu-operator-resources
$ helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources --wait

如果须要指定驱动版本，可参考如下：

$ helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources \
--set driver.version="450.80.02"

helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources\
   --set operator.defaultRuntime=crio

helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources\
   --set operator.defaultRuntime=containerd
   
Furthermore, when setting containerd as the defaultRuntime the following options are also available:
toolkit:
  env:
  - name: CONTAINERD_CONFIG
    value: /etc/containerd/config.toml
  - name: CONTAINERD_SOCKET
    value: /run/containerd/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: true

因为装置的镜像比拟大，所以首次装置过程中可能会呈现超时的情景，请查看你的镜像是否在拉取中！能够思考应用离线装置解决该类问题，参考离线装置的链接。

$ helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources -f values.yaml

$ kubectl get pods -n gpu-operator-resources
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-4gk78                                   1/1     Running     0          35s
gpu-operator-858fc55fdb-jv488                                 1/1     Running     0          2m52s
gpu-operator-node-feature-discovery-master-7f9ccc4c7b-2sg6r   1/1     Running     0          2m52s
gpu-operator-node-feature-discovery-worker-cbkhn              1/1     Running     0          2m52s
gpu-operator-node-feature-discovery-worker-m8jcm              1/1     Running     0          2m52s
nvidia-container-toolkit-daemonset-tfwqt                      1/1     Running     0          2m42s
nvidia-dcgm-exporter-mqns5                                    1/1     Running     0          38s
nvidia-device-plugin-daemonset-7npbs                          1/1     Running     0          53s
nvidia-device-plugin-validation                               0/1     Completed   0          49s
nvidia-driver-daemonset-hgv6s                                 1/1     Running     0          2m47s

$ kubectl describe node worker-gpu-001
---
Allocatable:
  cpu:                15600m
  ephemeral-storage:  82435528Ki
  hugepages-2Mi:      0
  memory:             63649242267
  nvidia.com/gpu:     1  #check here
  pods:               110
---

$ cat cuda-load-generator.yaml
apiVersion: v1
kind: Pod
metadata:
   name: dcgmproftester
spec:
   restartPolicy: OnFailure
   containers:
   - name: dcgmproftester11
   image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04
   args: ["--no-dcgm-validation", "-t 1004", "-d 120"]
   resources:
      limits:
         nvidia.com/gpu: 1
   securityContext:
      capabilities:
         add: ["SYS_ADMIN"]
EOF

$ curl -LO https://nvidia.github.io/gpu-operator/notebook-example.yml
$ cat notebook-example.yml
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  type: NodePort
  ports:
  - port: 80
    name: http
    targetPort: 8888
    nodePort: 30001
  selector:
    app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  securityContext:
    fsGroup: 0
  containers:
  - name: tf-notebook
    image: tensorflow/tensorflow:latest-gpu-jupyter
    resources:
      limits:
        nvidia.com/gpu: 1
    ports:
    - containerPort: 8

$ kubectl apply -f cuda-load-generator.yaml 
pod/dcgmproftester created
$ kubectl apply -f notebook-example.yml       
service/tf-notebook created
pod/tf-notebook created

查看 GPU 处于已调配状态:

$ kubectl describe node worker-gpu-001
---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1087m (6%)   1680m (10%)
  memory             1440Mi (2%)  1510Mi (2%)
  ephemeral-storage  0 (0%)       0 (0%)
  nvidia.com/gpu     1            1 #check this
Events:              <none>

当有 GPU 工作公布给平台时，GPU 资源从可调配状态转变为已调配状态，装置工作公布的先后顺序，第二个工作在第一个工作运行完结后开始运行：

$ kubectl get pods --watch
NAME             READY   STATUS    RESTARTS   AGE
dcgmproftester   1/1     Running   0          76s
tf-notebook      0/1     Pending   0          58s
------
NAME             READY   STATUS      RESTARTS   AGE
dcgmproftester   0/1     Completed   0          4m22s
tf-notebook      1/1     Running     0          4m4s

获取利用端口信息：

$ kubectl get svc # get the nodeport of the svc, 30001
gpu-operator-1611672791-node-feature-discovery   ClusterIP   10.233.10.222   <none>        8080/TCP       12h
kubernetes                                       ClusterIP   10.233.0.1      <none>        443/TCP        12h
tf-notebook                                      NodePort    10.233.53.116   <none>        80:30001/TCP   7m52s

查看日志，获取登录口令：

$ kubectl logs tf-notebook 
[I 21:50:23.188 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 21:50:23.390 NotebookApp] Serving notebooks from local directory: /tf
[I 21:50:23.391 NotebookApp] The Jupyter Notebook is running at:
[I 21:50:23.391 NotebookApp] http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
[I 21:50:23.391 NotebookApp]  or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
[I 21:50:23.391 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 21:50:23.394 NotebookApp]
   To access the notebook, open this file in a browser:
      file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
   Or copy and paste one of these URLs:
      http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
   or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9

进入jupyter notebook 环境后，尝试进入终端，运行深度学习工作：

进入 terminal 后拉取 tersorflow 测试代码并运行：

与此同时，开启另外一个终端运行 nvidia-smi 查看 GPU 监控应用状况：

gpu-operator帮咱们提供了 nvidia-dcgm-exporter 这个 exportor, 只须要将它集成到Prometheus 的可采集对象中，也就是 ServiceMonitor 中，咱们就能获取 GPU 监控数据了:

$ kubectl get pods -n gpu-operator-resources
NAME                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-ff4ng                1/1     Running     2          15h
nvidia-container-toolkit-daemonset-2vxjz   1/1     Running     0          15h
nvidia-dcgm-exporter-pqwfv                 1/1     Running     0          5h27m #here
nvidia-device-plugin-daemonset-42n74       1/1     Running     0          5h27m
nvidia-device-plugin-validation            0/1     Completed   0          5h27m
nvidia-driver-daemonset-dvd9r              1/1     Running     3          15h

能够构建一个 busybox 查看该 exporter 裸露的指标:

$ kubectl get svc -n gpu-operator-resources
NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
gpu-operator-node-feature-discovery   ClusterIP   10.233.54.111   <none>        8080/TCP   56m
nvidia-dcgm-exporter                  ClusterIP   10.233.53.196   <none>        9400/TCP   54m
$ kubectl exec -it busybox-sleep -- sh
$ wget http://nvidia-dcgm-exporter.gpu-operator-resources:9400/metrics
$ cat metrics
----
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-eeff7856-475a-2eb7-6408-48d023d9dd28",device="nvidia0",container="tf-notebook",namespace="default",pod="tf-notebook"} 405
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-eeff7856-475a-2eb7-6408-48d023d9dd28",device="nvidia0",container="tf-notebook",namespace="default",pod="tf-notebook"} 715
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-eeff7856-475a-2eb7-6408-48d023d9dd28",device="nvidia0",container="tf-notebook",namespace="default",pod="tf-notebook"} 30
----

查看 nvidia-dcgm-exporter 裸露的 svc 和ep：

$ kubectl describe svc nvidia-dcgm-exporter -n gpu-operator-resources
Name:                     nvidia-dcgm-exporter
Namespace:                gpu-operator-resources
Labels:                   app=nvidia-dcgm-exporter
Annotations:              prometheus.io/scrape: true
Selector:                 app=nvidia-dcgm-exporter
Type:                     NodePort
IP:                       10.233.28.200
Port:                     gpu-metrics  9400/TCP
TargetPort:               9400/TCP
NodePort:                 gpu-metrics  31129/TCP
Endpoints:                10.233.84.54:9400
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

配置 ServiceMonitor 定义清单:

$ cat custom/gpu-servicemonitor.yaml 
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nvidia-dcgm-exporter
  namespace: gpu-operator-resources 
  labels:
     app: nvidia-dcgm-exporter
spec:
  jobLabel: nvidia-gpu
  endpoints:
  - port: gpu-metrics
    interval: 15s
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  namespaceSelector:
    matchNames:
    - gpu-operator-resources
$ kubectl apply -f custom/gpu-servicemonitor.yaml

将 servicemonitor 提交给 kubesphere 平台后，通过裸露 prometheus-k8s 为NodePort，咱们能够在 Prometheus 的UI上验证一下是否采集到的相干指标：

`KubeSphere 3.0`

如果部署的 KubeSphere 版本是KubeSphere 3.0，须要简略地配置以下几个步骤，便可顺利完成可察看性监控。

首先，登录 kubsphere console 后，创立一个企业空间名称为ks-monitoring-demo, 名称可按需创立;

其次，须要将 ServiceMonitor 所在的指标名称空间 gpu-operator-resources 调配为已存在的企业空间中，以便纳入监控。

最初，进入指标企业空间，在纳管的我的项目找到gpu-operator-resources, 点击后找到可自定义监控界面, 即可增加自定义监控。

后续版本可抉择增加集群监控

下载 dashboard 以及配置namespace:

$ curl -LO https://raw.githubusercontent.com/kubesphere/monitoring-dashboard/master/contrib/gallery/nvidia-gpu-dcgm-exporter-dashboard.yaml
$ cat nvidia-gpu-dcgm-exporter-dashboard.yaml
----
apiVersion: monitoring.kubesphere.io/v1alpha1
kind: Dashboard
metadata:
  name: nvidia-dcgm-exporter-dashboard-rev1
  namespace: gpu-operator-resources  # check here
spec:
-----

能够间接命令行 apply 或者在自定义监控面板中抉择编辑模式进行导入：

正确导入后：

在下面创立的 jupyter notebook 运行深度学习测试工作后，能够显著地察看到相干 GPU 指标变动：

$ helm list -n gpu-operator-resources
NAME            NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator    gpu-operator-resources  1               2021-02-20 11:50:56.162559286 +0800 CST deployed        gpu-operator-1.5.2      1.5.2     
$ helm uninstall gpu-operator -n gpu-operator-resources

对于已部署失常运行的 gpu-operator 和 AI 利用的集群，重启 GPU 主机后会呈现没法用上 GPU 的状况，极有可能是因为插件还没加载，利用优先进行了载入，就会导致这种问题。这时，只须要优先保障插件运行失常，而后重新部署利用即可。

答：对于已部署失常运行的 gpu-operator 和 AI 利用的集群，重启 GPU 主机后会呈现没法用上 GPU 的状况，极有可能是因为插件还没加载，利用优先进行了载入，就会导致这种问题。这时，只须要优先保障插件运行失常，而后重新部署利用即可。

我之前针对 GPU 应用的是 https://github.com/NVIDIA/k8s… 和 https://github.com/NVIDIA/gpu… 相结合的计划来监控 GPU，请问这个计划与 GPU-Operator 的计划相比，孰优孰劣一些？

答：集体认为 GPU-Operator 更简略易用，其自带 GPU 注入能力不须要构建专用的 OS，并且反对节点发现与可插拔，可能自动化集成治理 GPU 所需的 NVIDIA 设施组件，相对来说还是很省事的。

答：能够参考 KubeSphere 官网文档来应用自定义监控。

GitHub: https://github.com/NVIDIA/gpu…
GitLab: https://gitlab.com/nvidia/kub…

GPU-Operator 疾速入门：https://docs.nvidia.com/datac…
GPU-Operator 离线装置指南：https://docs.nvidia.com/datac…
KubeSphere 自定义监控应用文档：https://kubesphere.com.cn/doc…

本文由博客一文多发平台 OpenWrite 公布！

GPU-Operator 简介

GPU-Operator 架构原理

GPU-Operator 装置阐明

前提条件

反对的 linux 版本

反对的容器运行时

装置 doker 环境

装置 NVIDIA Docker

装置 Helm

装置 NVIDIA GPU Operator

docker as runtime

crio as runtime

containerd as runtime

应用 values.yaml 装置

思考离线装置

利用部署

查看已部署 operator 服务状态

查看 pods 状态

查看节点资源是否处于可调配

部署官网文档中的两个实例

实例一

实例二

基于 Jupyter Notebook 利用运行深度学习训练任务

部署利用

运行深度学习工作

利用 KubeSphere 自定义监控性能监控 GPU

部署 ServiceMonitor

查看 GPU 指标是否被采集到（可选）

创立 KubeSphere GPU 自定义监控面板

KubeSphere 3.0

后续版本

创立自定义监控

卸载

重启无奈应用 GPU

GPU-Operator 常见问题

GPU-Operator 重启后无奈应用

Nvidia k8s-device-plugin 与 GPU-Operator 计划比照？

有没有 KubeSphere 自定义监控的具体应用教程？

参考资料

官网代码仓库

官网文档

`KubeSphere 3.0`