作者|KIDGINBROOK
更新|潘丽晨
上节介绍所有节点执行了bootstrap网络连接的建设,接下来介绍下拓扑剖析。
因为GPU机器架构是多种多样的,一台机器上可能有多个网卡,多个GPU卡,卡间连接也各不相同,因而须要对机器内设施连贯拓扑进行剖析,以使性能在各种拓扑构造下都尽可能好。
接着上回持续看initTransportsRank。
static ncclResult_t initTransportsRank(struct ncclComm* comm, ncclUniqueId* commId) { // We use 3 AllGathers // 1. { peerInfo, comm } // 2. ConnectTransport[nranks], ConnectValue[nranks] // 3. { nThreads, nrings, compCap, prev[MAXCHANNELS], next[MAXCHANNELS] } int rank = comm->rank; int nranks = comm->nRanks; uint64_t commHash = getHash(commId->internal, NCCL_UNIQUE_ID_BYTES); TRACE(NCCL_INIT, "comm %p, commHash %lx, rank %d nranks %d - BEGIN", comm, commHash, rank, nranks); NCCLCHECK(bootstrapInit(commId, rank, nranks, &comm->bootstrap)); // AllGather1 - begin struct { struct ncclPeerInfo peerInfo; struct ncclComm* comm; } *allGather1Data; NCCLCHECK(ncclCalloc(&allGather1Data, nranks)); allGather1Data[rank].comm = comm; struct ncclPeerInfo* myInfo = &allGather1Data[rank].peerInfo; NCCLCHECK(fillInfo(comm, myInfo, commHash)); ...}
创立nrank个allGather1Data,而后通过fillInfo 填充以后rank的peerInfo,ncclPeerInfo是rank的一些根本信息,比方rank号,在哪个机器的哪个过程等。
struct ncclPeerInfo { int rank; int cudaDev; int gdrSupport; uint64_t hostHash; uint64_t pidHash; dev_t shmDev; int64_t busId;}; static ncclResult_t fillInfo(struct ncclComm* comm, struct ncclPeerInfo* info, uint64_t commHash) { info->rank = comm->rank; CUDACHECK(cudaGetDevice(&info->cudaDev)); info->hostHash=getHostHash()+commHash; info->pidHash=getPidHash()+commHash; // Get the device MAJOR:MINOR of /dev/shm so we can use that // information to decide whether we can use SHM for inter-process // communication in a container environment struct stat statbuf; SYSCHECK(stat("/dev/shm", &statbuf), "stat"); info->shmDev = statbuf.st_dev; info->busId = comm->busId; NCCLCHECK(ncclGpuGdrSupport(&info->gdrSupport)); return ncclSuccess;}
获取以后卡的rank,PCIe busId,/dev/shm的设施号,填充到ncclPeerInfo,而后通过ncclGpuGdrSupport查看是否反对gdr,rdma在通信前须要注册一段内存,使得网卡晓得虚拟地址和物理地址的映射,然而如果每次通信都须要将data从显存拷贝到内存再通信的话效率就比拟低。
而IB提供了peer memory的接口,使得ib网卡能够拜访其余PCIe空间,nv基于peer memory实现了本人的驱动,使得rdma能够间接注册显存,这样通信就能够防止host和device的内存拷贝,IB能够间接dma显存,即gdr。
static ncclResult_t ncclGpuGdrSupport(int* gdrSupport) { int netDevs; NCCLCHECK(ncclNetDevices(&netDevs)); *gdrSupport = 0; for (int dev=0; dev<netDevs; dev++) { // Find a net device which is GDR-capable ncclNetProperties_t props; NCCLCHECK(ncclNet->getProperties(dev, &props)); if ((props.ptrSupport & NCCL_PTR_CUDA) == 0) continue; // Allocate memory on the GPU and try to register it on the NIC. void *lComm = NULL, *sComm = NULL, *rComm = NULL; ncclNetHandle_t handle; void* gpuPtr = NULL; void* mHandle = NULL; NCCLCHECK(ncclNetListen(dev, &handle, &lComm)); NCCLCHECK(ncclNetConnect(dev, &handle, &sComm)); NCCLCHECK(ncclNetAccept(lComm, &rComm)); CUDACHECK(cudaMalloc(&gpuPtr, GPU_BUF_SIZE)); ncclDebugNoWarn = NCCL_NET; if (ncclNetRegMr(sComm, gpuPtr, GPU_BUF_SIZE, NCCL_PTR_CUDA, &mHandle) == ncclSuccess) { NCCLCHECK(ncclNetDeregMr(sComm, mHandle)); NCCLCHECK(ncclNetRegMr(rComm, gpuPtr, GPU_BUF_SIZE, NCCL_PTR_CUDA, &mHandle)); NCCLCHECK(ncclNetDeregMr(rComm, mHandle)); *gdrSupport = 1; } ncclDebugNoWarn = 0; CUDACHECK(cudaFree(gpuPtr)); NCCLCHECK(ncclNetCloseRecv(rComm)); NCCLCHECK(ncclNetCloseSend(sComm)); NCCLCHECK(ncclNetCloseListen(lComm)); break; } return ncclSuccess;}
这里会遍历每一个网卡,获取网卡的信息,由第一节能够晓得这里的ncclNet就是ncclNetIb。
ncclResult_t ncclIbGdrSupport(int ibDev) { static int moduleLoaded = -1; if (moduleLoaded == -1) { moduleLoaded = (access("/sys/kernel/mm/memory_peers/nv_mem/version", F_OK) == -1) ? 0 : 1; } if (moduleLoaded == 0) return ncclSystemError; return ncclSuccess;} ncclResult_t ncclIbGetProperties(int dev, ncclNetProperties_t* props) { props->name = ncclIbDevs[dev].devName; props->pciPath = ncclIbDevs[dev].pciPath; props->guid = ncclIbDevs[dev].guid; props->ptrSupport = NCCL_PTR_HOST; if (ncclIbGdrSupport(dev) != ncclSuccess) { INFO(NCCL_NET,"NET/IB : GPU Direct RDMA Disabled for HCA %d '%s' (no module)", dev, ncclIbDevs[dev].devName); } else { props->ptrSupport |= NCCL_PTR_CUDA; } props->speed = ncclIbDevs[dev].speed; props->port = ncclIbDevs[dev].port + ncclIbDevs[dev].realPort; props->maxComms = ncclIbDevs[dev].maxQp; return ncclSuccess;}
这里次要是获取网卡名,PCIe门路,guid等信息,而后查看是否有/sys/kernel/mm/memory_peers/nv_mem/version判断是否装置了nv_peermem,即nv的驱动,如果装置了的话则设置props->ptrSupport |= NCCL_PTR_CUDA,示意能够注册显存。
而后尝试注册显存,如果能够注册则设置gdrSupport为1,这里其实会创立rdma连贯,这个在后边会独自介绍,本次先略过
static ncclResult_t initTransportsRank(struct ncclComm* comm, ncclUniqueId* commId) { ... NCCLCHECK(bootstrapAllGather(comm->bootstrap, allGather1Data, sizeof(*allGather1Data))); NCCLCHECK(ncclCalloc(&comm->peerInfo, nranks+1)); // Extra rank to represent CollNet root for (int i = 0; i < nranks; i++) { memcpy(comm->peerInfo+i, &allGather1Data[i].peerInfo, sizeof(struct ncclPeerInfo)); if ((i != rank) && (comm->peerInfo[i].hostHash == myInfo->hostHash) && (comm->peerInfo[i].busId == myInfo->busId)) { WARN("Duplicate GPU detected : rank %d and rank %d both on CUDA device %x", rank, i, myInfo->busId); return ncclInvalidUsage; } } // AllGather1 data is used again below // AllGather1 - end // Topo detection / System graph creation NCCLCHECK(ncclTopoGetSystem(comm, &comm->topo)); ...}
而后bootstrapAllGather播送allGather1Data,将获取到的其余节点peerinfo拷贝到comm里。
在看具体拓扑剖析流程之前先简略理解一下PCIe的一些概念,一个简略的PCIe零碎示例如下。
每个CPU都有本人的root complex,后简称为RC,RC会帮忙cpu和其余局部通信,比方和内存,和PCIe零碎,当cpu发送过去一个物理地址之后,如果这个地址是在PCIe空间,会被RC转换成PCIe申请进行通信。
switch的作用是扩大PCIe端口,下边能够连贯设施或者其余switch,上游来的申请被被他转发,PCIe设施能够连在RC,也能够连在swtich,一个switch的外部如下所示
外部有一个PCIe总线 ,而后通过多个Bridge扩大出多个端口,其中上边的那个称为上游端口,其余的叫做上游端口。
前文有提到NCCL中很罕用的一个变量名叫busId,比方gpu和ib网卡,留神辨别NCCL里的busId并不是指的总线号,指的其实是定位一个PCIe设施用到的id,即BDF(bus + device + function),一个bus上有多个设施,一个设施有多个性能,因而通过BDF就能够定位一个设施,在机器启动实现PCIe的配置之后会将相干信息通过sysfs提供给用户,NCCL就是通过sysfs来实现拓扑检测的。
而后看下执行的ncclTopoGetSystem,这个函数就是本节的重点,会将以后rank的PCI树建设起来,分为两个步骤,先应用xml示意整个PCI树结构,而后基于xml转成ncclTopoNode,其中xml定义如下,一个ncclXmlNode示意了PCI树的一个节点。
struct ncclXmlNode { char name[MAX_STR_LEN]; struct { char key[MAX_STR_LEN]; char value[MAX_STR_LEN]; } attrs[MAX_ATTR_COUNT+1]; // Need an extra one to consume extra params int nAttrs; int type; struct ncclXmlNode* parent; struct ncclXmlNode* subs[MAX_SUBS]; int nSubs;}; struct ncclXml { struct ncclXmlNode nodes[MAX_NODES]; int maxIndex;};
ncclXmlNode示意一个节点,记录了父节点和所有子节点,节点有name和attr,通过xmlSetAttr进行设置属性。
ncclXml中预调配了所有的node,maxIndex示意调配到了哪里,而后简略介绍下几个xml相干的api。
static ncclResult_t xmlAddNode(struct ncclXml* xml, struct ncclXmlNode* parent, const char* subName, struct ncclXmlNode** sub);
xmlAddNode进行node的调配,示意在xml里新申请一个节点sub,sub的name设置为subName,父节点为parent。
static ncclResult_t xmlFindTagKv(struct ncclXml* xml, const char* tagName, struct ncclXmlNode** node, const char* attrName, const char* attrValue)
xmlFindTagKv会遍历xml已调配的节点,找到节点名为tagName的节点n,而后判断节点n["attrName"]是否等于attrValue,如果相等,则设置node为n。
static ncclResult_t xmlGetAttrIndex(struct ncclXmlNode* node, const char* attrName, int* index)
xmlGetAttrIndex会查看attrName是node的第几个属性。
而后开始看拓扑剖析的过程。
ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** system) { struct ncclXml* xml; NCCLCHECK(ncclCalloc(&xml, 1)); char* xmlTopoFile = getenv("NCCL_TOPO_FILE"); if (xmlTopoFile) { INFO(NCCL_ENV, "NCCL_TOPO_FILE set by environment to %s", xmlTopoFile); NCCLCHECK(ncclTopoGetXmlFromFile(xmlTopoFile, xml)); } if (xml->maxIndex == 0) { // Create top tag struct ncclXmlNode* top; NCCLCHECK(xmlAddNode(xml, NULL, "system", &top)); NCCLCHECK(xmlSetAttrInt(top, "version", NCCL_TOPO_XML_VERSION)); } // Auto-detect GPUs if needed for (int r=0; r<comm->nRanks; r++) { if (comm->peerInfo[r].hostHash == comm->peerInfo[comm->rank].hostHash) { char busId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE]; NCCLCHECK(int64ToBusId(comm->peerInfo[r].busId, busId)); struct ncclXmlNode* node; NCCLCHECK(ncclTopoFillGpu(xml, busId, &node)); if (node == NULL) continue; NCCLCHECK(xmlSetAttrInt(node, "rank", r)); NCCLCHECK(xmlInitAttrInt(node, "gdr", comm->peerInfo[r].gdrSupport)); } } ...}
首先通过xmlAddNode创立根节点"system"(后续应用双引号示意xml树节点),并设置根节点属性"system" ["version"] = NCCL_TOPO_XML_VERSION,而后遍历每个rank的hosthash,如果相等的话阐明在同一个机器,而后执行ncclTopoFillGpu,将gpu退出到xml树。
ncclResult_t ncclTopoFillGpu(struct ncclXml* xml, const char* busId, struct ncclXmlNode** gpuNode) { struct ncclXmlNode* node; NCCLCHECK(ncclTopoGetPciNode(xml, busId, &node)); NCCLCHECK(ncclTopoGetXmlFromSys(node, xml)); ...}
ncclResult_t ncclTopoGetPciNode(struct ncclXml* xml, const char* busId, struct ncclXmlNode** pciNode) { NCCLCHECK(xmlFindTagKv(xml, "pci", pciNode, "busid", busId)); if (*pciNode == NULL) { NCCLCHECK(xmlAddNode(xml, NULL, "pci", pciNode)); } NCCLCHECK(xmlSetAttr(*pciNode, "busid", busId)); return ncclSuccess;}
通过ncclTopoGetPciNode获取xml中的有没有创立以后卡的xml node,此时没有,所以就新建一个xml node叫做"pci",示意以后gpu卡,设置"pci"["busid"]=busd。
而后执行ncclTopoGetXmlFromSys,这个函数次要逻辑就是在sysfs中获取gpu节点到cpu的门路,通过这个门路转成xml树,并读取该门路下相干属性设置到xml里。
ncclResult_t ncclTopoGetXmlFromSys(struct ncclXmlNode* pciNode, struct ncclXml* xml) { // Fill info, then parent const char* busId; NCCLCHECK(xmlGetAttr(pciNode, "busid", &busId)); char* path = NULL; int index; NCCLCHECK(xmlGetAttrIndex(pciNode, "class", &index)); if (index == -1) { if (path == NULL) NCCLCHECK(getPciPath(busId, &path)); NCCLCHECK(ncclTopoSetAttrFromSys(pciNode, path, "class", "class")); } NCCLCHECK(xmlGetAttrIndex(pciNode, "link_speed", &index)); if (index == -1) { if (path == NULL) NCCLCHECK(getPciPath(busId, &path)); char deviceSpeedStr[MAX_STR_LEN]; float deviceSpeed; NCCLCHECK(ncclTopoGetStrFromSys(path, "max_link_speed", deviceSpeedStr)); sscanf(deviceSpeedStr, "%f GT/s", &deviceSpeed); char portSpeedStr[MAX_STR_LEN]; float portSpeed; NCCLCHECK(ncclTopoGetStrFromSys(path, "../max_link_speed", portSpeedStr)); sscanf(portSpeedStr, "%f GT/s", &portSpeed); NCCLCHECK(xmlSetAttr(pciNode, "link_speed", portSpeed < deviceSpeed ? portSpeedStr : deviceSpeedStr)); } NCCLCHECK(xmlGetAttrIndex(pciNode, "link_width", &index)); if (index == -1) { if (path == NULL) NCCLCHECK(getPciPath(busId, &path)); char strValue[MAX_STR_LEN]; NCCLCHECK(ncclTopoGetStrFromSys(path, "max_link_width", strValue)); int deviceWidth = strtol(strValue, NULL, 0); NCCLCHECK(ncclTopoGetStrFromSys(path, "../max_link_width", strValue)); int portWidth = strtol(strValue, NULL, 0); NCCLCHECK(xmlSetAttrInt(pciNode, "link_width", std::min(deviceWidth,portWidth))); } ...}
首先设置pciNode的各种属性,通过getPciPath获取busId对应的sysfs门路path,其实这个门路就是PCI树中根到叶结点的门路。
static ncclResult_t getPciPath(const char* busId, char** path) { char busPath[] = "/sys/class/pci_bus/0000:00/../../0000:00:00.0"; memcpylower(busPath+sizeof("/sys/class/pci_bus/")-1, busId, BUSID_REDUCED_SIZE-1); memcpylower(busPath+sizeof("/sys/class/pci_bus/0000:00/../../")-1, busId, BUSID_SIZE-1); *path = realpath(busPath, NULL); if (*path == NULL) { WARN("Could not find real path of %s", busPath); return ncclSystemError; } return ncclSuccess;}
举个例子比方path是 /sys/devices/pci0000:10/0000:10:00.0/0000:11:00.0/0000:12:00.0/0000:13:00.0/0000
:14:00.0/0000:15:00.0/0000:16:00.0/0000:17:00.0,其中GPU的busId是0000:17:00.0,那么这个path对应下图,留神,下图略去了15:00.0对应的switch。
而后读取path下的属性,获取class(PCI设施类型),link_speed,link_width等设置到xml pciNode中,ncclTopoGetStrFromSys其实就是读取path下的内核文件保留到strValue。
ncclResult_t ncclTopoGetStrFromSys(const char* path, const char* fileName, char* strValue) { char filePath[PATH_MAX]; sprintf(filePath, "%s/%s", path, fileName); int offset = 0; FILE* file; if ((file = fopen(filePath, "r")) != NULL) { while (feof(file) == 0 && ferror(file) == 0 && offset < MAX_STR_LEN) { int len = fread(strValue+offset, 1, MAX_STR_LEN-offset, file); offset += len; } fclose(file); } if (offset == 0) { strValue[0] = '\0'; INFO(NCCL_GRAPH, "Topology detection : could not read %s, ignoring", filePath); } else { strValue[offset-1] = '\0'; } return ncclSuccess;}
ncclResult_t ncclTopoGetXmlFromSys(struct ncclXmlNode* pciNode, struct ncclXml* xml) { // Fill info, then parent ... struct ncclXmlNode* parent = pciNode->parent; if (parent == NULL) { if (path == NULL) NCCLCHECK(getPciPath(busId, &path)); // Save that for later in case next step is a CPU char numaIdStr[MAX_STR_LEN]; NCCLCHECK(ncclTopoGetStrFromSys(path, "numa_node", numaIdStr)); // Go up one level in the PCI tree. Rewind two "/" and follow the upper PCI // switch, or stop if we reach a CPU root complex. int slashCount = 0; int parentOffset; for (parentOffset = strlen(path)-1; parentOffset>0; parentOffset--) { if (path[parentOffset] == '/') { slashCount++; path[parentOffset] = '\0'; int start = parentOffset - 1; while (start>0 && path[start] != '/') start--; // Check whether the parent path looks like "BBBB:BB:DD.F" or not. if (checkBDFFormat(path+start+1) == 0) { // This a CPU root complex. Create a CPU tag and stop there. struct ncclXmlNode* topNode; NCCLCHECK(xmlFindTag(xml, "system", &topNode)); NCCLCHECK(xmlGetSubKv(topNode, "cpu", &parent, "numaid", numaIdStr)); if (parent == NULL) { NCCLCHECK(xmlAddNode(xml, topNode, "cpu", &parent)); NCCLCHECK(xmlSetAttr(parent, "numaid", numaIdStr)); } } else if (slashCount == 2) { // Continue on the upper PCI switch for (int i = strlen(path)-1; i>0; i--) { if (path[i] == '/') { NCCLCHECK(xmlFindTagKv(xml, "pci", &parent, "busid", path+i+1)); if (parent == NULL) { NCCLCHECK(xmlAddNode(xml, NULL, "pci", &parent)); NCCLCHECK(xmlSetAttr(parent, "busid", path+i+1)); } break; } } } } if (parent) break; } pciNode->parent = parent; parent->subs[parent->nSubs++] = pciNode; } if (strcmp(parent->name, "pci") == 0) { NCCLCHECK(ncclTopoGetXmlFromSys(parent, xml)); } else if (strcmp(parent->name, "cpu") == 0) { NCCLCHECK(ncclTopoGetXmlFromCpu(parent, xml)); } free(path); return ncclSuccess;}
而后从pciNode开始往上跳,因为一个switch的上游端口和上游端口别离对应了一个bridge,NCCL应用上游端口bridge的busid示意这个switch,因而这里要向上跳两次再建设一个xml node示意这个switch,往上找到一个PCI设施就将slashCount加一
当slashCount==2就找到了一个switch上游端口,这个时候创立一个新的xml pci节点parent示意以后switch,而后将以后节点pciNode链接到parent,此时parent依然是xml pci节点。
因而,持续递归执行ncclTopoGetXmlFromSys,直到遇到RC,此时给"system"创立一个子节点"cpu",进行递归,而后执行ncclTopoGetXmlFromCpu,设置"cpu"的各种属性,比方arch(比方x86还是arm),affinity(该cpu的numa都有哪些cpu core),numaid等。
到这里ncclTopoGetXmlFromSys就执行完结了,接着看ncclTopoFillGpu。
ncclResult_t ncclTopoFillGpu(struct ncclXml* xml, const char* busId, struct ncclXmlNode** gpuNode) { ... NCCLCHECK(wrapNvmlSymbols()); NCCLCHECK(wrapNvmlInit()); nvmlDevice_t nvmlDev; if (wrapNvmlDeviceGetHandleByPciBusId(busId, &nvmlDev) != ncclSuccess) nvmlDev = NULL; NCCLCHECK(ncclTopoGetXmlFromGpu(node, nvmlDev, xml, gpuNode)); return ncclSuccess;}
而后通过wrapNvmlSymbols加载动静库libnvidia-ml.so.1,用来获取gpu的相干信息。
ncclResult_t ncclTopoGetXmlFromGpu(struct ncclXmlNode* pciNode, nvmlDevice_t nvmlDev, struct ncclXml* xml, struct ncclXmlNode** gpuNodeRet) { struct ncclXmlNode* gpuNode = NULL; NCCLCHECK(xmlGetSub(pciNode, "gpu", &gpuNode)); if (gpuNode == NULL) NCCLCHECK(xmlAddNode(xml, pciNode, "gpu", &gpuNode)); int index = -1; int dev = -1; NCCLCHECK(xmlGetAttrIndex(gpuNode, "dev", &index)); if (index == -1) { if (nvmlDev == NULL) { WARN("No NVML, trying to use CUDA instead"); const char* busId; NCCLCHECK(xmlGetAttr(pciNode, "busid", &busId)); if (busId == NULL || cudaDeviceGetByPCIBusId(&dev, busId) != cudaSuccess) dev = -1; } else { NCCLCHECK(wrapNvmlDeviceGetIndex(nvmlDev, (unsigned int*)&dev)); } NCCLCHECK(xmlSetAttrInt(gpuNode, "dev", dev)); } NCCLCHECK(xmlGetAttrInt(gpuNode, "dev", &dev)); if (dev == -1) { *gpuNodeRet = NULL; return ncclSuccess; } NCCLCHECK(xmlGetAttrIndex(gpuNode, "sm", &index)); if (index == -1) { int cudaMajor, cudaMinor; if (nvmlDev == NULL) { cudaDeviceProp devProp; CUDACHECK(cudaGetDeviceProperties(&devProp, dev)); cudaMajor = devProp.major; cudaMinor = devProp.minor; } else { NCCLCHECK(wrapNvmlDeviceGetCudaComputeCapability(nvmlDev, &cudaMajor, &cudaMinor)); } NCCLCHECK(xmlSetAttrInt(gpuNode, "sm", cudaMajor*10+cudaMinor)); } int sm; NCCLCHECK(xmlGetAttrInt(gpuNode, "sm", &sm)); struct ncclXmlNode* nvlNode = NULL; NCCLCHECK(xmlGetSub(pciNode, "nvlink", &nvlNode)); if (nvlNode == NULL) { // NVML NVLink detection int maxNvLinks = (sm < 60) ? 0 : (sm < 70) ? 4 : (sm < 80) ? 6 : 12; if (maxNvLinks > 0 && nvmlDev == NULL) { WARN("No NVML device handle. Skipping nvlink detection.\n"); maxNvLinks = 0; } for (int l=0; l<maxNvLinks; ++l) { // Check whether we can use this NVLink for P2P unsigned canP2P; if ((wrapNvmlDeviceGetNvLinkCapability(nvmlDev, l, NVML_NVLINK_CAP_P2P_SUPPORTED, &canP2P) != ncclSuccess) || !canP2P) continue; // Make sure the Nvlink is up. The previous call should have trained the link. nvmlEnableState_t isActive; if ((wrapNvmlDeviceGetNvLinkState(nvmlDev, l, &isActive) != ncclSuccess) || (isActive != NVML_FEATURE_ENABLED)) continue; // Try to figure out what's on the other side of the NVLink nvmlPciInfo_t remoteProc; if (wrapNvmlDeviceGetNvLinkRemotePciInfo(nvmlDev, l, &remoteProc) != ncclSuccess) continue; // Make a lower case copy of the bus ID for calling ncclDeviceType // PCI system path is in lower case char* p = remoteProc.busId; char lowerId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE]; for (int c=0; c<NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE; c++) { lowerId[c] = tolower(p[c]); if (p[c] == 0) break; } NCCLCHECK(xmlGetSubKv(gpuNode, "nvlink", &nvlNode, "target", lowerId)); if (nvlNode == NULL) { NCCLCHECK(xmlAddNode(xml, gpuNode, "nvlink", &nvlNode)); NCCLCHECK(xmlSetAttr(nvlNode, "target", lowerId)); NCCLCHECK(xmlSetAttrInt(nvlNode, "count", 1)); } else { int count; NCCLCHECK(xmlGetAttrInt(nvlNode, "count", &count)); NCCLCHECK(xmlSetAttrInt(nvlNode, "count", count+1)); } } } // Fill target classes for (int s=0; s<gpuNode->nSubs; s++) { struct ncclXmlNode* sub = gpuNode->subs[s]; if (strcmp(sub->name, "nvlink") != 0) continue; int index; NCCLCHECK(xmlGetAttrIndex(sub, "tclass", &index)); if (index == -1) { const char* busId; NCCLCHECK(xmlGetAttr(sub, "target", &busId)); if (strcmp(busId, "fffffff:ffff:ff") == 0) { // Remote NVLink device is not visible inside this VM. Assume NVSwitch. NCCLCHECK(xmlSetAttr(sub, "tclass", "0x068000")); } else { char* path; NCCLCHECK(getPciPath(busId, &path)); NCCLCHECK(ncclTopoSetAttrFromSys(sub, path, "class", "tclass")); } } } *gpuNodeRet = gpuNode; return ncclSuccess;}
首先在xml gpu节点"pci"下创立节点"gpu",而后设置"gpu"节点的属性,比方dev,计算能力sm,而后开始查问nvlink相干信息,遍历所有可能的nvlink,通过nvmlDeviceGetNvLinkCapability查问nvlink信息。
如果这个nvlink被启用,那么在"gpu"节点下新建一个"nvlink"节点,设置"target"属性示意nvlink对端的PCIe busId,将"target"雷同的"nvlink"节点示意为一个,用"count"示意起止点之间有多少条nvlink,而后设置属性"tclass"示意"target"是什么类型的PCI设施。
到这里ncclTopoFillGpu就执行完结了,此时xml如下所示,图里只展现了一张网卡的状况,其中"gpu"和他的父节点其实都是指的同一个gpu。
而后回到ncclTopoGetSystem,会设置"gpu"的rank和gdr属性。
而后是对于所有的网卡,相似上述gpu的过程,通过ncclTopoGetXmlFromSys建设xml树,如下所示,只展现一张网卡的状况,其中"net","nic"和"nic"的父节点都示意同一张网卡。
<system version="1"> <cpu numaid="0" affinity="00000000,0000000f,ffff0000,00000000,000fffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85"> <pci busid="0000:11:00.0" class="0x060400" link_speed="8 GT/s" link_width="16"> <pci busid="0000:13:00.0" class="0x060400" link_speed="8 GT/s" link_width="16"> <pci busid="0000:15:00.0" class="0x060400" link_speed="8 GT/s" link_width="16"> <pci busid="0000:17:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"> <gpu dev="0" sm="80" rank="0" gdr="1"> <nvlink target="0000:e7:00.0" count="2" tclass="0x068000"/> <nvlink target="0000:e4:00.0" count="2" tclass="0x068000"/> <nvlink target="0000:e6:00.0" count="2" tclass="0x068000"/> <nvlink target="0000:e9:00.0" count="2" tclass="0x068000"/> <nvlink target="0000:e5:00.0" count="2" tclass="0x068000"/> <nvlink target="0000:e8:00.0" count="2" tclass="0x068000"/> </gpu> </pci> </pci> </pci> <pci busid="0000:1c:00.0" class="0x020000" link_speed="8 GT/s" link_width="16"> <nic> <net name="mlx5_0" dev="0" speed="100000" port="1" guid="0x82d0c0003f6ceb8" maxconn="262144" gdr="1"/> </nic> </pci> </pci> </cpu></system>
总结一下,本节次要介绍了NCCL拓扑剖析的过程,通过sysfs将gpu和网卡对应的pci树结构建设进去了xml树。
(原文:
https://blog.csdn.net/KIDGIN7439/article/details/126990961)
欢送 Star、试用 OneFlow 最新版本:https://github.com/Oneflow-Inc/oneflow/