deviceplugin-扩展intelsriovdeviceplugin解读

共计 4736 个字符，预计需要花费 12 分钟才能阅读完成。

了解 sriov 网卡的正确读取姿势，学习相关的 golang 开源库

看这部分代码前我们先看看官方提供的一个配置文件范例：

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourceName": "intel_sriov_netdevice",
                "selectors": {"vendors": ["8086"], 
                    "devices": ["154c", "10ed"],
                    "drivers": ["i40evf", "ixgbevf"]
                }
            },
            {
                "resourceName": "intel_sriov_dpdk",
                "selectors": {"vendors": ["8086"],
                    "devices": ["154c", "10ed"],
                    "drivers": ["vfio-pci"],
                    "pfNames": ["enp0s0f0","enp2s2f1"]
                }
            },
            {
                "resourceName": "mlnx_sriov_rdma",
                "isRdma": true,
                "selectors": {"vendors": ["15b3"],
                    "devices": ["1018"],
                    "drivers": ["mlx5_ib"]
                }
            }
        ]
    }

这里 selectors 是用来过滤机器上的 pci 网络设备的，只要当前 check 的网卡的属性包含于配置文件里对应属性的数组，就会被选入。

vendors 表示厂商号，devices 表示设备号，drivers 表示驱动，pfNames 表示 vf 所在 pf 的网卡名。

比如某个网卡的 bdf 是 0000:00:11.4，我们可以通过 lspci -n |grep 00:11.4 得到他的信息：

lspci  -n |grep '00:11.4'     
00:11.4 0106: 8086:8d62 (rev 05)

8086 就是厂商号，8d62 就是设备号。

intel/sriov-network-device-plugin的程序入口在cmd/sriovdp/main.go

提供了两个基本的参数：

config-file。指定配置参数
resource-prefix。指定要注册到 apiserver 中的硬件资源的前缀，比如intel.com，netease.com

main 函数中调用了四个主要函数：

resourceManager是 deviceplugin 的核心逻辑组件，初始化一个空的 resourceManager，并依次调用readConfig,validConfigs, 注入配置。配置由ResourceConfig 数组构成，包括：

type ResourceConfig struct {
    ResourceName string `json:"resourceName"` // the resource name will be added with resource prefix in K8s api
    IsRdma       bool   // the resource support rdma
    Selectors    struct {Vendors   []string `json:"vendors,omitempty"`
        Devices   []string `json:"devices,omitempty"`
        Drivers   []string `json:"drivers,omitempty"`
        PfNames   []string `json:"pfNames,omitempty"`
        LinkTypes []string `json:"linkTypes,omitempty"`} `json:"selectors,omitempty"` // Whether devices have SRIOV virtual function capabilities or not
}

intel 引用了github.com/jaypipes/ghw，将机器上的 PCI 设备全部 list 一遍，并判断和逐步过滤, 最后归纳出所有 vf 设备，记录的数据类型是types.PciNetDevice：

pci 的设备必须是网络设备 -> 机器上必须没有该网卡对应的 default 路由 -> 是 sriov 的 vf 设备  -> 加入到 resourceManager.netDeviceList

在上面我们提到的 config-file 中，我们可以定义多组配置，每组是一个resourceName, 它会对应一个 device-plugin 的服务。对于每一组配置，做的事情如下：

生成资源池 resourcePool。他们将之前在配置文件中写的多个 Selectors 应用到resourceManager.netDeviceList 中(每个 selector 的 filter 逻辑见 pkg/resources/deviceSelectors.go)，得到我们想要的类型的 vf 设备，并还能根据配置文件过滤出 rdma 设备。我们可以看到在resourcePool 的结构体中包含了一个devicePool, 这里是真正存储设备的地方。
初始化一个 resourceServer, 按照我们的配置，生成一个 resourceServer，它包括真正的资源池resourcePool,grpc Server,resourcePrefix 等：

return &resourceServer{
        resourcePool:       rp,
        pluginWatch:        pluginWatch,
        endPoint:           sockName,
        sockPath:           sockPath,
        resourceNamePrefix: prefix,
        grpcServer:         grpc.NewServer(),
        termSignal:         make(chan bool, 1),
        updateSignal:       make(chan bool),
        stopWatcher:        make(chan bool),
        checkIntervals:     20, // updates every 20 seconds
    }

执行 resourceServer 的 Start 方法，运行 device-plugin。device-plugin 会先向 kubelet 发送一个 Register。kubelet 收到后会启动一个 grpc 客户端，与 resourceServer 中的 grpcServer 建立连接。

kubelet 会远程调用ListAndWatch，并在需要申请资源时调用Allocate。

intel 官方提供的配置文件并不足以满足所有的 sriov 设备，比如当我们使用 Mellanox 5 系列的硬件时，用这套 configmap 是没法识别出 vf 的。按照上问的说明，我们可以在机器上使用 lspci 确定该 sriov 网卡的 vf 的 vendorID、deviceID，driver，通过添加或修改 configmap 内容，实现对该硬件的采集。如：

# lspci  -n |grep 04:05.1
04:05.1 0200: 15b3:1016
# ls -l /sys/class/pci_bus/0000:04/device/0000:04:05.1  |grep driver
lrwxrwxrwx 1 root root       0 Nov 20 11:09 driver -> ../../../../bus/pci/drivers/mlx5_core
-rw-r--r-- 1 root root    4096 Nov 20 11:09 driver_override

## change configmap yaml file:
...
            {
                "resourceName": "mlnx_sriov_rdma",
                "isRdma": true,
                "selectors": {"vendors": ["15b3"],
                    "devices": ["1016"],
                    "drivers": ["mlx5_core"]
                }
            }
            ...

将 resourcePool 中的所有设备组成数组返回给 kubelet。然后开始监听退出信号和更新信号。一旦发生设备更新，会刷新将设备数组返回给 kubelet

kubelet 会在内存中记录设备数组。

kubelet 会记录那些容器用了哪些设备，并找到一个可以用的设备，向 device-plugin 申请该设备。resourceServer 收到请求后，将使用该设备需要设置的 docker option 封装返回。options 包括：

Devices
Envs
Mounts

从上面的分析我们知道，resourcePool中记录的资源，归根结底·是 types.PciNetDevice 资源。所以我们还是要从 github.com\intel\sriov-network-device-plugin\pkg\resources\pciNetDevice.go 去找“到底返回给 kubelet 的数据是怎样的”。

我们看到 func (nd *pciNetDevice) GetDeviceSpecs() 和func (nd *pciNetDevice) GetMounts()返回的都是资源的结构体字段，在func NewPciNetDevice 中我们又看到这些字段的赋值是按照不同驱动区分的：

func NewPciNetDevice(pciDevice *ghw.PCIDevice, rFactory types.ResourceFactory) (types.PciNetDevice, error) {
  pciAddr := pciDevice.Address
  driverName, err := utils.GetDriverName(pciAddr)
  ...
  // 3. Get Device file info (e.g., uio, vfio specific)
  // Get DeviceInfoProvider using device driver
  infoProvider := rFactory.GetInfoProvider(driverName)
  dSpecs := infoProvider.GetDeviceSpecs(pciAddr)
  mnt := infoProvider.GetMounts(pciAddr)
  env := infoProvider.GetEnvVal(pciAddr)
  ...
}

在 factory.go 中，我们看到：

func (rf *resourceFactory) GetInfoProvider(name string) types.DeviceInfoProvider {
    switch name {
    case "vfio-pci":
        return newVfioResourcePool()
    case "uio", "igb_uio":
        return newUioResourcePool()
    default:
        return newNetDevicePool()}
}

意思很清楚了，不同的 pciDevice 有不同的 driver，通过读取其 driver，我们可以确认这些 vf 的厂家和型号。从而组织成对应的docker options

这里 intel 提供的可兼容 driver 包括：

vfio-pci。见pkg/resources/vfioPool.go
uio 或 igb_uio。见pkg/resources/uioPool.go
其他。见 pkg/resources/netDevicePool.go（可以由用户自己扩展，只要实现GetDeviceSpecs,GetEnvVal,GetMounts 就可以用）

学习目的

配置文件

入口和参数

第一步：newResourceManager

第二步：discoverHostDevices

第三步：initServers

第四步：startAllServers

配置文件自定义

device-plugin 的工作内容

ListAndWatch

Allocate

Just My Socks（注册教程内含优惠码）