本文将从 GPU-Operator 概念介绍、装置部署、深度训练测试利用部署,以及在 KubeSphere 应用自定义监控面板对接 GPU 监控,从原理到实际,逐渐浅析介绍与实际 GPU-Operator。

GPU-Operator简介

家喻户晓,Kubernetes 平台通过设施插件框架提供对非凡硬件资源的拜访,如 NVIDIA GPU、网卡、Infiniband 适配器和其余设施。然而,应用这些硬件资源配置和治理节点须要配置多个软件组件,如驱动程序、容器运行时或其余依赖库,这是艰难的和容易出错的。

NVIDIA GPU Operator 由 Nvidia 公司开源,利用了 Kubernetes 平台的 Operator 管制模式,不便地自动化集成治理 GPU 所需的 NVIDIA 设施组件,无效地解决了上述GPU设施集成的痛点。这些组件包含 NVIDIA 驱动程序(用于启用 CUDA )、用于 GPU 的 Kubernetes 设施插件、NVIDIA Container 运行时、主动节点标签、基于 DCGM 的监控等。

NVIDIA GPU Operator 的不仅实现了设施和组件一体化集成,而且它治理 GPU 节点就像治理 CPU 节点一样不便,无需独自为 GPU 节点提供非凡的操作系统。值得关注的是,它将GPU各组件容器化,提供 GPU 能力,非常适合疾速扩大和治理规模 GPU 节点。当然,对于曾经为GPU组件构建了非凡操作系统的利用场景来说,显得并不是那么适合了。

GPU-Operator 架构原理

前文提到,NVIDIA GPU Operator 治理 GPU 节点就像治理 CPU 节点一样不便,那么它是如何实现这一能力呢?

咱们一起来看看 GPU-Operator 运行时的架构图:

通过图中的形容,咱们能够晓得, GPU-Operator 是通过实现了 Nvidia 容器运行时,以runC作为输出,在runCpreStart hook中注入了一个名叫nvidia-container-toolkit的脚本,该脚本调用libnvidia-container CLI设置一系列适合的flags,使得容器运行后具备 GPU 能力。

GPU-Operator 装置阐明

前提条件

在装置 GPU Operator 之前,请配置好装置环境如下:

  • 所有节点不须要事后装置NVIDIA组件(driver,container runtime,device plugin);
  • 所有节点必须配置Docker,cri-o, 或者containerd.对于 docker 来说,能够参考这里;
  • 如果应用HWE内核(e.g. kernel 5.x) 的 Ubuntu 18.04 LTS 环境下,须要给nouveau driver增加黑名单,须要更新initramfs
$ sudo vim /etc/modprobe.d/blacklist.conf # 在尾部增加黑名单blacklist nouveauoptions nouveau modeset=0$ sudo update-initramfs -u$ reboot$ lsmod | grep nouveau # 验证nouveau是否已禁用$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c  #本文测试时处理器架构代号为Broadwell16 Intel Core Processor (Broadwell)
  • 节点发现(NFD) 须要在每个节点上配置,默认状况会间接装置,如果曾经配置,请在Helm chart变量设置nfd.enabledfalse, 再装置;
  • 如果应用 Kubernetes 1.13和1.14, 须要激活 KubeletPodResources;

反对的linux版本

OS Name / VersionIdentifieramd64 / x86_64ppc64learm64 / aarch64
Amazon Linux 1amzn1X
Amazon Linux 2amzn2X
Amazon Linux 2017.09amzn2017.09X
Amazon Linux 2018.03amzn2018.03X
Open Suse Leap 15.0sles15.0X
Open Suse Leap 15.1sles15.1X
Debian Linux 9debian9X
Debian Linux 10debian10X
Centos 7centos7XX
Centos 8centos8XXX
RHEL 7.4rhel7.4XX
RHEL 7.5rhel7.5XX
RHEL 7.6rhel7.6XX
RHEL 7.7rhel7.7XX
RHEL 8.0rhel8.0XXX
RHEL 8.1rhel8.1XXX
RHEL 8.2rhel8.2XXX
Ubuntu 16.04ubuntu16.04XX
Ubuntu 18.04ubuntu18.04XXX
Ubuntu 20.04ubuntu20.04XXX

反对的容器运行时

OS Name / Versionamd64 / x86_64ppc64learm64 / aarch64
Docker 18.09XXX
Docker 19.03XXX
RHEL/CentOS 8 podmanX
CentOS 8 DockerX
RHEL/CentOS 7 DockerX

装置doker环境

可参考 Docker 官网文档

装置NVIDIA Docker

配置 stable 仓库和 GPG key :

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

更新软件仓库后装置nvidia-docker2并增加运行时配置:

$ sudo apt-get update$ sudo apt-get install -y nvidia-docker2-----What would you like to do about it ?  Your options are:Y or I  : install the package maintainer's versionN or O  : keep your currently-installed versionD     : show the differences between the versionsZ     : start a shell to examine the situation-----# 首次装置,遇到以上交互式问题可抉择N# 如果抉择Y会笼罩你的一些默认配置# 抉择N后,将以下配置增加到etc/docker/daemon.json{  "runtimes": {      "nvidia": {          "path": "/usr/bin/nvidia-container-runtime",          "runtimeArgs": []      }  }}

重启docker:

$ sudo systemctl restart docker

装置Helm

$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \   && chmod 700 get_helm.sh \   && ./get_helm.sh

增加helm仓库

$ helm repo add nvidia https://nvidia.github.io/gpu-operator \   && helm repo update

装置 NVIDIA GPU Operator

docker as runtime

$ kubectl create ns gpu-operator-resources$ helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources --wait

如果须要指定驱动版本,可参考如下:

$ helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources \--set driver.version="450.80.02"

crio as runtime

helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources\   --set operator.defaultRuntime=crio

containerd as runtime

helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources\   --set operator.defaultRuntime=containerd   Furthermore, when setting containerd as the defaultRuntime the following options are also available:toolkit:  env:  - name: CONTAINERD_CONFIG    value: /etc/containerd/config.toml  - name: CONTAINERD_SOCKET    value: /run/containerd/containerd.sock  - name: CONTAINERD_RUNTIME_CLASS    value: nvidia  - name: CONTAINERD_SET_AS_DEFAULT    value: true

因为装置的镜像比拟大,所以首次装置过程中可能会呈现超时的情景,请查看你的镜像是否在拉取中!能够思考应用离线装置解决该类问题,参考离线装置的链接。

应用 values.yaml 装置

$ helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources -f values.yaml

思考离线装置

利用部署

查看已部署 operator 服务状态

查看 pods 状态

$ kubectl get pods -n gpu-operator-resourcesNAME                                                          READY   STATUS      RESTARTS   AGEgpu-feature-discovery-4gk78                                   1/1     Running     0          35sgpu-operator-858fc55fdb-jv488                                 1/1     Running     0          2m52sgpu-operator-node-feature-discovery-master-7f9ccc4c7b-2sg6r   1/1     Running     0          2m52sgpu-operator-node-feature-discovery-worker-cbkhn              1/1     Running     0          2m52sgpu-operator-node-feature-discovery-worker-m8jcm              1/1     Running     0          2m52snvidia-container-toolkit-daemonset-tfwqt                      1/1     Running     0          2m42snvidia-dcgm-exporter-mqns5                                    1/1     Running     0          38snvidia-device-plugin-daemonset-7npbs                          1/1     Running     0          53snvidia-device-plugin-validation                               0/1     Completed   0          49snvidia-driver-daemonset-hgv6s                                 1/1     Running     0          2m47s

查看节点资源是否处于可调配

$ kubectl describe node worker-gpu-001---Allocatable:  cpu:                15600m  ephemeral-storage:  82435528Ki  hugepages-2Mi:      0  memory:             63649242267  nvidia.com/gpu:     1  #check here  pods:               110---

部署官网文档中的两个实例

实例一

$ cat cuda-load-generator.yamlapiVersion: v1kind: Podmetadata:   name: dcgmproftesterspec:   restartPolicy: OnFailure   containers:   - name: dcgmproftester11   image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04   args: ["--no-dcgm-validation", "-t 1004", "-d 120"]   resources:      limits:         nvidia.com/gpu: 1   securityContext:      capabilities:         add: ["SYS_ADMIN"]EOF

实例二

$ curl -LO https://nvidia.github.io/gpu-operator/notebook-example.yml$ cat notebook-example.ymlapiVersion: v1kind: Servicemetadata:  name: tf-notebook  labels:    app: tf-notebookspec:  type: NodePort  ports:  - port: 80    name: http    targetPort: 8888    nodePort: 30001  selector:    app: tf-notebook---apiVersion: v1kind: Podmetadata:  name: tf-notebook  labels:    app: tf-notebookspec:  securityContext:    fsGroup: 0  containers:  - name: tf-notebook    image: tensorflow/tensorflow:latest-gpu-jupyter    resources:      limits:        nvidia.com/gpu: 1    ports:    - containerPort: 8

基于 Jupyter Notebook 利用运行深度学习训练任务

部署利用

$ kubectl apply -f cuda-load-generator.yaml pod/dcgmproftester created$ kubectl apply -f notebook-example.yml       service/tf-notebook createdpod/tf-notebook created

查看 GPU 处于已调配状态:

$ kubectl describe node worker-gpu-001---Allocated resources:  (Total limits may be over 100 percent, i.e., overcommitted.)  Resource           Requests     Limits  --------           --------     ------  cpu                1087m (6%)   1680m (10%)  memory             1440Mi (2%)  1510Mi (2%)  ephemeral-storage  0 (0%)       0 (0%)  nvidia.com/gpu     1            1 #check thisEvents:              <none>

当有 GPU 工作公布给平台时,GPU 资源从可调配状态转变为已调配状态,装置工作公布的先后顺序,第二个工作在第一个工作运行完结后开始运行:

$ kubectl get pods --watchNAME             READY   STATUS    RESTARTS   AGEdcgmproftester   1/1     Running   0          76stf-notebook      0/1     Pending   0          58s------NAME             READY   STATUS      RESTARTS   AGEdcgmproftester   0/1     Completed   0          4m22stf-notebook      1/1     Running     0          4m4s

获取利用端口信息:

$ kubectl get svc # get the nodeport of the svc, 30001gpu-operator-1611672791-node-feature-discovery   ClusterIP   10.233.10.222   <none>        8080/TCP       12hkubernetes                                       ClusterIP   10.233.0.1      <none>        443/TCP        12htf-notebook                                      NodePort    10.233.53.116   <none>        80:30001/TCP   7m52s

查看日志,获取登录口令:

$ kubectl logs tf-notebook [I 21:50:23.188 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret[I 21:50:23.390 NotebookApp] Serving notebooks from local directory: /tf[I 21:50:23.391 NotebookApp] The Jupyter Notebook is running at:[I 21:50:23.391 NotebookApp] http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9[I 21:50:23.391 NotebookApp]  or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9[I 21:50:23.391 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).[C 21:50:23.394 NotebookApp]   To access the notebook, open this file in a browser:      file:///root/.local/share/jupyter/runtime/nbserver-1-open.html   Or copy and paste one of these URLs:      http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9   or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9

运行深度学习工作

进入jupyter notebook 环境后,尝试进入终端,运行深度学习工作:

进入terminal后拉取tersorflow测试代码并运行:

与此同时,开启另外一个终端运行nvidia-smi查看 GPU 监控应用状况:

利用 KubeSphere 自定义监控性能监控 GPU

部署 ServiceMonitor

gpu-operator帮咱们提供了nvidia-dcgm-exporter这个exportor, 只须要将它集成到Prometheus的可采集对象中,也就是ServiceMonitor中,咱们就能获取GPU监控数据了:

$ kubectl get pods -n gpu-operator-resourcesNAME                                       READY   STATUS      RESTARTS   AGEgpu-feature-discovery-ff4ng                1/1     Running     2          15hnvidia-container-toolkit-daemonset-2vxjz   1/1     Running     0          15hnvidia-dcgm-exporter-pqwfv                 1/1     Running     0          5h27m #herenvidia-device-plugin-daemonset-42n74       1/1     Running     0          5h27mnvidia-device-plugin-validation            0/1     Completed   0          5h27mnvidia-driver-daemonset-dvd9r              1/1     Running     3          15h

能够构建一个busybox查看该exporter裸露的指标:

$ kubectl get svc -n gpu-operator-resourcesNAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGEgpu-operator-node-feature-discovery   ClusterIP   10.233.54.111   <none>        8080/TCP   56mnvidia-dcgm-exporter                  ClusterIP   10.233.53.196   <none>        9400/TCP   54m$ kubectl exec -it busybox-sleep -- sh$ wget http://nvidia-dcgm-exporter.gpu-operator-resources:9400/metrics$ cat metrics----DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-eeff7856-475a-2eb7-6408-48d023d9dd28",device="nvidia0",container="tf-notebook",namespace="default",pod="tf-notebook"} 405DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-eeff7856-475a-2eb7-6408-48d023d9dd28",device="nvidia0",container="tf-notebook",namespace="default",pod="tf-notebook"} 715DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-eeff7856-475a-2eb7-6408-48d023d9dd28",device="nvidia0",container="tf-notebook",namespace="default",pod="tf-notebook"} 30----

查看nvidia-dcgm-exporter裸露的svcep

$ kubectl describe svc nvidia-dcgm-exporter -n gpu-operator-resourcesName:                     nvidia-dcgm-exporterNamespace:                gpu-operator-resourcesLabels:                   app=nvidia-dcgm-exporterAnnotations:              prometheus.io/scrape: trueSelector:                 app=nvidia-dcgm-exporterType:                     NodePortIP:                       10.233.28.200Port:                     gpu-metrics  9400/TCPTargetPort:               9400/TCPNodePort:                 gpu-metrics  31129/TCPEndpoints:                10.233.84.54:9400Session Affinity:         NoneExternal Traffic Policy:  ClusterEvents:                   <none>

配置ServiceMonitor定义清单:

$ cat custom/gpu-servicemonitor.yaml apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata:  name: nvidia-dcgm-exporter  namespace: gpu-operator-resources   labels:     app: nvidia-dcgm-exporterspec:  jobLabel: nvidia-gpu  endpoints:  - port: gpu-metrics    interval: 15s  selector:    matchLabels:      app: nvidia-dcgm-exporter  namespaceSelector:    matchNames:    - gpu-operator-resources$ kubectl apply -f custom/gpu-servicemonitor.yaml

查看 GPU 指标是否被采集到(可选)

servicemonitor提交给kubesphere平台后,通过裸露prometheus-k8sNodePort,咱们能够在PrometheusUI上验证一下是否采集到的相干指标:

创立 KubeSphere GPU 自定义监控面板

KubeSphere 3.0

如果部署的 KubeSphere 版本是KubeSphere 3.0,须要简略地配置以下几个步骤,便可顺利完成可察看性监控。

首先, 登录kubsphere console后,创立一个企业空间名称为ks-monitoring-demo, 名称可按需创立;

其次,须要将ServiceMonitor所在的指标名称空间gpu-operator-resources调配为已存在的企业空间中,以便纳入监控。

最初,进入指标企业空间,在纳管的我的项目找到gpu-operator-resources, 点击后找到可自定义监控界面, 即可增加自定义监控。

后续版本

后续版本可抉择增加集群监控

创立自定义监控

下载dashboard以及配置namespace:

$ curl -LO https://raw.githubusercontent.com/kubesphere/monitoring-dashboard/master/contrib/gallery/nvidia-gpu-dcgm-exporter-dashboard.yaml$ cat nvidia-gpu-dcgm-exporter-dashboard.yaml----apiVersion: monitoring.kubesphere.io/v1alpha1kind: Dashboardmetadata:  name: nvidia-dcgm-exporter-dashboard-rev1  namespace: gpu-operator-resources  # check herespec:-----

能够间接命令行apply或者在自定义监控面板中抉择编辑模式进行导入:

正确导入后:

在下面创立的jupyter notebook运行深度学习测试工作后,能够显著地察看到相干GPU指标变动:

卸载

$ helm list -n gpu-operator-resourcesNAME            NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                   APP VERSIONgpu-operator    gpu-operator-resources  1               2021-02-20 11:50:56.162559286 +0800 CST deployed        gpu-operator-1.5.2      1.5.2     $ helm uninstall gpu-operator -n gpu-operator-resources

重启无奈应用 GPU

对于已部署失常运行的gpu-operator和AI利用的集群,重启GPU主机后会呈现没法用上 GPU 的状况,极有可能是因为插件还没加载,利用优先进行了载入,就会导致这种问题。这时,只须要优先保障插件运行失常,而后重新部署利用即可。

GPU-Operator 常见问题

GPU-Operator 重启后无奈应用

答:对于已部署失常运行的gpu-operator和 AI 利用的集群,重启 GPU 主机后会呈现没法用上 GPU 的状况,极有可能是因为插件还没加载,利用优先进行了载入,就会导致这种问题。这时,只须要优先保障插件运行失常,而后重新部署利用即可。

Nvidia k8s-device-plugin 与 GPU-Operator 计划比照?

我之前针对GPU应用的是 https://github.com/NVIDIA/k8s... 和 https://github.com/NVIDIA/gpu... 相结合的计划来监控 GPU,请问这个计划与 GPU-Operator的计划相比,孰优孰劣一些?

答:集体认为 GPU-Operator 更简略易用,其自带 GPU 注入能力不须要构建专用的 OS,并且反对节点发现与可插拔,可能自动化集成治理 GPU 所需的 NVIDIA 设施组件,相对来说还是很省事的。

有没有 KubeSphere 自定义监控的具体应用教程?

答:能够参考 KubeSphere 官网文档来应用自定义监控。

参考资料

官网代码仓库

  • GitHub: https://github.com/NVIDIA/gpu...
  • GitLab: https://gitlab.com/nvidia/kub...

官网文档

  • GPU-Operator 疾速入门:https://docs.nvidia.com/datac...
  • GPU-Operator 离线装置指南:https://docs.nvidia.com/datac...
  • KubeSphere 自定义监控应用文档:https://kubesphere.com.cn/doc...
本文由博客一文多发平台 OpenWrite 公布!