cudaErrorCudartUnloading问题排查及建议方案

共计 12175 个字符，预计需要花费 31 分钟才能阅读完成。

原文请猛戳这里

敲黑板划重点——顺求异构计算 / 高性能计算 /CUDA/ARM 优化类开发职位

最近一段时间一直在负责做我厂神经网络前向框架库的优化，前几天接了一个 bug report，报错信息大体是这样的：

Program hit cudaErrorCudartUnloading (error 29) due to "driver shutting down" on CUDA API call to cudaFreeHost.

同样的库链接出来的可执行文件，有的会出现这种问题有的不会，一开始让我很自然以为是使用库的应用程序出了 bug。排除了这种可能之后，这句话最后的 cudaFreeHost 又让我想当然地以为是个内存相关的问题，折腾了一阵后才发现方向又双叒叕错了。而且我发现，无论我在报错的那段代码前使用任何 CUDA runtime API，都会出现这个错误。
后来在网上查找相关信息，以下的 bug report 虽然没有具体解决方案，但相似的 call stack 让我怀疑这和我遇到的是同一个问题，而且也让我把怀疑的目光聚焦在 ”driver shutting down” 而非 cudaFreeHost 上。

https://github.com/opencv/ope…
https://github.com/BVLC/caffe…
https://stackoverflow.com/que…
https://github.com/NVlabs/SAS…
https://blog.csdn.net/jobbofh…

首先一个看似理所当然的思路是：我们能否在使用 CUDA API 时防止 CUDA driver 不被 shutdown 呢？问题在于 ”driver shutting down” 究竟指的是什么？如果从 cudaErrorCudartUnloading 的字面意思来讲，很可能是指 cuda_runtime 的 library 被卸载了。
由于我们用的是动态链接库，于是我尝试在报错的地方前加上 dlopen 强制加载 libcuda_runtime.so。改完后马上发现不对，如果是动态库被卸载，理应是调用 CUDA API 时发现相关 symbol 都没有定义才对，而不应该是可以正常调用动态库的函数、然后返回 error code 这样的 runtime error 现象。
此外，我通过 strace 发现，还有诸如 libcuda.so、libnvidia-fatbinaryloader.so 之类的动态库会被加载，都要试一遍并不现实。何况和 CUDA 相关的动态库并不少（可参考《NVIDIA Accelerated Linux Graphics Driver README and Installation Guide》中的“Chapter 5. Listing of Installed Components”），不同的程序依赖的动态库也不尽相同，上述做法即使可行，也很难通用。

无独有偶，在 nvidia 开发者论坛上也有开发者有类似的想法，被官方人士否定了：

For instance, can I have my class maintain certain variables/handles that will force cuda run time library to stay loaded.

No. It is a bad design practice to put calls to the CUDA runtime API in constructors that may run before main and destructors that may run after main.

对于 CUDA 应用程序开发者而言，我们通常是通过调用 CUDA runtime API 来向 GPU 设备下达我们的指令。所以首先让我们来看，在程序中调用 CUDA runtime API 时，有什么角色参与了进来。我从 Nicholas Wilt 的《The CUDA Handbook》中借了一张图：

{% img http://galoisplusplus.coding…. %}

我们可以看到，主要的角色有：运行在操作系统的 User Mode 下的 CUDART(CUDA Runtime) library（对于动态库来说就是上文提到的 libcuda_runtime.so）和 CUDA driver library（对于动态库来说就是上文提到的libcuda.so），还有运行在 Kernel Mode 下的 CUDA driver 内核模块。众所周知，我们的 CUDA 应用程序是运行在操作系统的 User Mode 下的，无法直接操作 GPU 硬件，在操作系统中有权控制 GPU 硬件的是运行在 Kernel Mode 下的内核模块（OT 一下，作为 CUDA 使用者，我们很少能感觉到这些内核模块的存在，也它们许最有存在感的时候就是我们遇上Driver/library version mismatch 错误了 XD）。在 Linux 下我们可以通过 lsmod | grep nvidia 来查看这些内核模块，通常有管理 Unified Memory 的nvidia_uvm、Linux 内核 Direct Rendering Manager 显示驱动nvidia_drm、还有nvidia_modeset。与这些内核模块沟通的是运行在 User Mode 下的 CUDA driver library，我们所调用的 CUDA runtime API 会被 CUDART library 转换成一系列 CUDA driver API，交由 CUDA driver library 这个连接 CUDA 内核模块与其他运行在 User Mode 下 CUDA library 的中介。

那么，要使 CUDA runtime API 所表示的指令能被正常传达到 GPU，就需要上述角色都能通力协作了。这就自然引发一个问题：在我们的程序运行的时候，这些角色什么时候开始 / 结束工作？它们什么时候被初始化？我们不妨 strace 看一下 CUDA 应用程序的系统调用：
首先，libcuda_runtime.so、libcuda.so、libnvidia-fatbinaryloader.so等动态库被加载。当前被加载进内核的内核模块列表文件 /proc/modules 被读取，由于 nvidia_uvm、nvidia_drm 等模块之前已被加载，所以不需要额外 insmod。接下来，设备参数文件/proc/driver/nvidia/params 被读取，相关的设备——如 /dev/nvidia0（GPU 卡 0）、/dev/nvidia-uvm（看名字自然与 Unified Memory 有关，可能是 Pascal 体系 Nvidia GPU 的 Page Migration Engine）、/dev/nvidiactl 等——被打开，并通过 ioctl 初始化设定。（此外，还有 home 目录下 ~/.nv/ComputeCache 的一些文件被使用，这个目录是用来缓存 PTX 伪汇编 JIT 编译后的二进制文件 fat binaries，与我们当前的问题无关，感兴趣的朋友可参考 Mark Harris 的《CUDA Pro Tip: Understand Fat Binaries and JIT Caching》。）要使 CUDA runtime API 能被正常执行，需要完成上述动态库的加载、内核模块的加载和 GPU 设备设置。

但以上还只是从系统调用角度来探究的一个必要条件，还有一个条件写过 CUDA 的朋友应该不陌生，那就是 CUDA context（如果你没印象了，可以回顾一下 CUDA 官方指南中讲初始化和 context 的部分）。我们都知道：所有 CUDA 的资源（包括分配的内存、CUDA event 等等）和操作都只在 CUDA context 内有效；在第一次调用 CUDA runtime API 时，如果当前设备没有创建 CUDA context，新的 context 会被创建出来作为当前设备的 primary context。这些操作对于 CUDA runtime API 使用者来说是不透明的，那么又是谁做的呢？让我来引用一下 SOF 上某个问题下 community wiki 的标准答案：

The CUDA front end invoked by nvcc silently adds a lot of boilerplate code and translation unit scope objects which perform CUDA context setup and teardown. That code must run before any API calls which rely on a CUDA context can be executed. If your object containing CUDA runtime API calls in its destructor invokes the API after the context is torn down, your code may fail with a runtime error.

这段话提供了几个信息：一是 nvcc 插入了一些代码来完成的 CUDA context 的创建和销毁所需要做的准备工作，二是 CUDA context 销毁之后再调用 CUDA runtime API 就可能会出现 runtime error 这样的未定义行为（Undefined Behaviour，简称 UB）。

接下来让我们来稍微深入地探究一下。我们有若干 .cu 文件通过 nvcc 编译后产生的 .o 文件，还有这些 .o 文件链接后生成的可执行文件 exe。我们通过nm 等工具去查看这些 .o 文件，不难发现这些文件的代码段中都被插入了一个以 __sti____cudaRegisterAll_ 为名字前缀的函数。我们在 gdb <exe> 中对其中函数设置断点再单步调试，可以看到类似这样的 call stack：

(gdb) bt
#0  0x00002aaab16695c0 in __cudaRegisterFatBinary () at /usr/local/cuda/lib64/libcudart.so.8.0
#1  0x00002aaaaad3eee1 in __sti____cudaRegisterAll_53_tmpxft_000017c3_00000000_19_im2col_compute_61_cpp1_ii_a0760701() ()
    at /tmp/tmpxft_000017c3_00000000-4_im2col.compute_61.cudafe1.stub.c:98
#2  0x00002aaaaaaba3a3 in _dl_init_internal () at /lib64/ld-linux-x86-64.so.2
#3  0x00002aaaaaaac46a in _dl_start_user () at /lib64/ld-linux-x86-64.so.2
#4  0x0000000000000001 in  ()
#5  0x00007fffffffe2a8 in  ()
#6  0x0000000000000000 in  ()

再执行若干步，call stack 就变成：

(gdb) bt
#0  0x00002aaab16692b0 in __cudaRegisterFunction () at /usr/local/cuda/lib64/libcudart.so.8.0
#1  0x00002aaaaad3ef3e in __sti____cudaRegisterAll_53_tmpxft_000017c3_00000000_19_im2col_compute_61_cpp1_ii_a0760701() (__T263=0x7c4b30)
    at /tmp/tmpxft_000017c3_00000000-4_im2col.compute_61.cudafe1.stub.c:97
#2  0x00002aaaaad3ef3e in __sti____cudaRegisterAll_53_tmpxft_000017c3_00000000_19_im2col_compute_61_cpp1_ii_a0760701() ()
    at /tmp/tmpxft_000017c3_00000000-4_im2col.compute_61.cudafe1.stub.c:98
#3  0x00002aaaaaaba3a3 in _dl_init_internal () at /lib64/ld-linux-x86-64.so.2
#4  0x00002aaaaaaac46a in _dl_start_user () at /lib64/ld-linux-x86-64.so.2
#5  0x0000000000000001 in  ()
#6  0x00007fffffffe2a8 in  ()
#7  0x0000000000000000 in  ()

(gdb) bt
#0  0x00002aaaaae8ea20 in atexit () at XXX.so
#1  0x00002aaaaaaba3a3 in _dl_init_internal () at /lib64/ld-linux-x86-64.so.2
#2  0x00002aaaaaaac46a in _dl_start_user () at /lib64/ld-linux-x86-64.so.2
#3  0x0000000000000001 in  ()
#4  0x00007fffffffe2a8 in  ()
#5  0x0000000000000000 in  ()

那么 CUDA context 何时被创建完成呢？通过对 cuInit 设置断点可以发现，与官方指南的描述一致，也就是在进入 main 函数之后调用第一个 CUDA runtime API 的时候：

(gdb) bt
#0  0x00002aaab1ab7440 in cuInit () at /lib64/libcuda.so.1
#1  0x00002aaab167add5 in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#2  0x00002aaab167ae31 in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#3  0x00002aaabe416bb0 in pthread_once () at /lib64/libpthread.so.0
#4  0x00002aaab16ad919 in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#5  0x00002aaab167700a in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#6  0x00002aaab167aceb in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#7  0x00002aaab16a000a in cudaGetDevice () at /usr/local/cuda/lib64/libcudart.so.8.0
...
#10 0x0000000000405d77 in main(int, char**) (argc=<optimized out>, argv=<optimized out>)

其中，和 context 创建相关的若干函数就在 ${CUDA_PATH}/include/crt/host_runtime.h 中声明过：

#define __cudaRegisterBinary(X)                                                   \
        __cudaFatCubinHandle = __cudaRegisterFatBinary((void*)&__fatDeviceText); \
        {void (*callback_fp)(void **) =  (void (*)(void **))(X); (*callback_fp)(__cudaFatCubinHandle); }\
        atexit(__cudaUnregisterBinaryUtil)
       

extern "C" {
extern void** CUDARTAPI __cudaRegisterFatBinary(void *fatCubin);

extern void CUDARTAPI __cudaUnregisterFatBinary(void **fatCubinHandle);

extern void CUDARTAPI __cudaRegisterFunction(
        void   **fatCubinHandle,
  const char    *hostFun,
        char    *deviceFun,
  const char    *deviceName,
        int      thread_limit,
        uint3   *tid,
        uint3   *bid,
        dim3    *bDim,
        dim3    *gDim,
        int     *wSize
);
}

static void **__cudaFatCubinHandle;

static void __cdecl __cudaUnregisterBinaryUtil(void)
{____nv_dummy_param_ref((void *)&__cudaFatCubinHandle);
  __cudaUnregisterFatBinary(__cudaFatCubinHandle);
}

但这些函数都没有文档，Yong Li 博士写的《GPGPU-SIM Code Study》稍微详细一些，我就直接贴过来了：

The simplest way to look at how nvcc compiles the ECS (Execution Configuration Syntax) and manages kernel code is to use nvcc’s --cuda switch. This generates a .cu.c file that can be compiled and linked without any support from NVIDIA proprietary tools. It can be thought of as CUDA source files in open source C. Inspection of this file verified how the ECS is managed, and showed how kernel code was managed.

Device code is embedded as a fat binary object in the executable’s .rodata section. It has variable length depending on the kernel code.

For each kernel, a host function with the same name as the kernel is added to the source code.

Before main(..) is called, a function called cudaRegisterAll(..) performs the following work:

• Calls a registration function, cudaRegisterFatBinary(..), with a void pointer to the fat binary data. This is where we can access the kernel code directly.

• For each kernel in the source file, a device function registration function, cudaRegisterFunction(..), is called. With the list of parameters is a pointer to the function mentioned in step 2.

As aforementioned, each ECS is replaced with the following function calls from the execution management category of the CUDA runtime API.

• cudaConfigureCall(..) is called once to set up the launch configuration.

• The function from the second step is called. This calls another function, in which, cudaSetupArgument(..) is called once for each kernel parameter. Then, cudaLaunch(..) launches the kernel with a pointer to the function from the second step.

An unregister function, cudaUnregisterBinaryUtil(..), is called with a handle to the fatbin data on program exit.

其中，cudaConfigureCall、cudaSetupArgument、cudaLaunch在 CUDA7.5 以后已经“过气”（deprecated）了，由于这些并不是在进入 main 函数之前会被调用的 API，我们可以不用管。我们需要关注的是，在 main 函数被调用之前，nvcc加入的内部初始化代码做了以下几件事情（我们可以结合上面 host_runtime.h 头文件暴露出的接口和相关 call stack 来确认）：

通过 __cudaRegisterFatBinary 注册 fat binary 入口函数。这是 CUDA context 创建的准备工作之一，如果在 __cudaRegisterFatBinary 执行之前调用 CUDA runtime API 很可能也会出现 UB。SOF 上就有这样一个问题，题主在 static 对象构造函数中调用了 kernel 函数，结果就出现了 ”invalid device function” 错误，SOF 上的 CUDA 大神 talonmies 的答案就探究了 static 对象构造函数和 __cudaRegisterFatBinary 的调用顺序及其产生的问题，非常推荐一读。
通过 __cudaRegisterFunction 注册每个 device 的 kernel 函数
通过 atexit 注册 __cudaUnregisterBinaryUtil 的注销函数。这个函数是 CUDA context 销毁的清理工作之一，前面提到，CUDA context 销毁之后 CUDA runtime API 就很可能无法再被正常使用了，换言之，如果 CUDA runtime API 在 __cudaUnregisterBinaryUtil 执行完后被调用就有可能是 UB。而 __cudaUnregisterBinaryUtil 在什么时候被调用又是符合 atexit 规则的——在 main 函数执行完后程序 exit 的某阶段被调用（main函数的执行过程可以参考这篇文章）——这也是我们理解和解决 cudaErrorCudartUnloading 问题的关键之处。

{% img http://galoisplusplus.coding…. %}

吃透本码渣上述啰里啰唆的理论后，再通过代码来排查 cudaErrorCudartUnloading 问题就简单了。原来，竟和之前提过的 SOF 上的问题相似，我们代码中也使用了一个全局 static singleton 对象，在 singleton 对象的析构函数中调用 CUDA runtime API 来执行释放内存等操作。而我们知道，static 对象是在 main 函数执行完后 exit 进行析构的，而之前提到 __cudaUnregisterBinaryUtil 也是在这个阶段被调用，这两者的顺序是未定义的。如果 __cudaUnregisterBinaryUtil 等清理 context 的操作在 static 对象析构之前就调用了，就会产生 cudaErrorCudartUnloading 报错。这种 UB 也解释了，为何之前我们的库链接出来的不同可执行文件，有的会出现这个问题而有的不会。

在 github 上搜 cudaErrorCudartUnloading 相关的 patch，处理方式也是五花八门，这里姑且列举几种。

跳过 `cudaErrorCudartUnloading` 检查

比如 arrayfire 项目的这个 patch。可以，这很佛系（滑稽）

-    CUDA_CHECK(cudaFree(ptr));
+    cudaError_t err = cudaFree(ptr);
+    if (err != cudaErrorCudartUnloading) // see issue #167
+        CUDA_CHECK(err);

干脆把可能会有 `cudaErrorCudartUnloading` 的 CUDA runtime API 去掉

比如 kaldi 项目的这个 issue 和 PR。论佛系，谁都不服就服你（滑稽）

把 CUDA runtime API 放到一个独立的 de-initialisation、finalize 之类的接口，让用户在 `main` 函数 `return` 前调用

比如 MXNet 项目的MXNotifyShutdown（参见：c_api.cc）。佛系了辣么久总算看到了一种符合本程序员审美的“优雅”方案（滑稽）

恰好在 SOF 另一个问题中，talonmies 大神（啊哈，又是 talonmies 大神！）在留言里也表达了一样的意思，不能赞同更多啊：

The obvious answer is don’t put CUDA API calls in the destructor. In your class you have an explicit intialisation method not called through the constructor, so why not have an explicit de-initialisation method as well? That way scope becomes a non-issue

上面的方案虽然“优雅”，但对于库维护者却有多了一层隐忧：万一加了个接口，使用者要撕逼呢？（滑稽）万一使用者根本就不鸟你，没在 main 函数 return 前调用呢？要说别人打开方式不对，人家还可以说是库的实现不够稳健把你批判一通呢。如果你也有这种隐忧，请接着看接下来的“黑科技”。

首先，CUDA runtime API 还是不能放在全局对象析构函数中，那么应该放在什么地方才合适呢？毕竟我们不知道库使用者最后用的是哪个 API 啊？不过，我们却可以知道库使用者使用什么 API 时是在 main 函数的作用域，那个时候是可以创建有效的 CUDA context、正常使用 CUDA runtime API 的。这又和我们析构函数中调用的 CUDA runtime API 有什么关系呢？你可能还记得吧，前边提到 nvcc 加入的内部初始化代码通过 atexit 注册 __cudaUnregisterBinaryUtil 的注销函数，我们自然也可以如法炮制：

// 首先调用一个“无害”的 CUDA runtime API，确保在调用 `atexit` 之前 CUDA context 已被创建
// 这样就确保我们通过 `atexit` 注册的函数在 CUDA context 相关的销毁函数（例如 `__cudaUnregisterBinaryUtil`）之前就被执行
//“无害”的 CUDA runtime API？这里指不会造成影响内存占用等副作用的函数，我采用了 `cudaGetDeviceCount`
//《The CUDA Handbook》中推荐使用 `cudaFree(0);` 来完成 CUDART 初始化 CUDA context 的过程，这也是可以的
int gpu_num;
cudaError_t err = cudaGetDeviceCount(&gpu_num);

std::atexit([](){// 调用原来在全局对象析构函数中的 CUDA runtime API});

那么，应该在哪个地方插入上面的代码呢？解铃还须系铃人，我们的 cudaErrorCudartUnloading 问题出在static singleton 对象身上，但以下 singleton 的惰性初始化却也给了我们提供了一个绝佳的入口：

// OT 一下，和本中老年人一样上了年纪的朋友可能知道
// 以前在 C ++ 中要实现线程安全的 singleton 有多蛋疼
// 有诸如 Double-Checked Locking 之类略恶心的写法
// 但自打用了 C ++11 之后啊，腰不酸了, 背不疼了, 腿啊也不抽筋了, 码代码也有劲儿了（滑稽）// 以下实现在 C ++11 标准中是保证线程安全的
static Singleton& instance()
{
     static Singleton s;
     return s;
}

因为库使用者只会在 main 函数中通过这个接口使用 singleton 对象，所以只要在这个接口初始化 CUDA context 并用 atexit 注册清理函数就可以辣！当然，作为一位严谨的库作者，你也许会问：不能对库使用者抱任何幻想，万一别人在某个全局变量初始化时调用了呢？Bingo！我只能说目前我们的业务流程可以让库使用者不会想这么写来恶心自己而已 …（捂脸）万一真的有这么作的使用者，这种方法就失效了，使用者会遇到和前面提到的 SOF 某问题相似的报错。毕竟，黑科技也不是万能的啊！

解决完 cudaErrorCudartUnloading 这个问题之后，又接到新的救火任务，排查一个使用加密狗 API 导致的程序闪退问题。加密狗和 cudaErrorCudartUnloading 两个问题看似风马牛不相及，本质竟然也是相似的：又是一样的 UB 现象；又是全局对象；又是在全局对象构造和析构时调用了加密狗 API，和加密狗内部的初始化和销毁函数的执行顺序未定义。看来，不乱挖坑还是要有基本的常识——在使用外设设备相关的接口时，要保证在 main 函数的作用域里啊！

《CUDA 官方指南》
Nicholas Wilt 的《The CUDA Handbook》
《NVIDIA Accelerated Linux Graphics Driver README and Installation Guide》中的“Chapter 5. Listing of Installed Components”
CUDA Pro Tip: Understand Fat Binaries and JIT Caching
Yong Li 博士写的《GPGPU-SIM Code Study》
The thorny path of Hello World

强制阻止 ”driver shutting down”？

如何使 CUDA runtime API 正常运作？

一切皆全局对象之过

解决方案

跳过 `cudaErrorCudartUnloading` 检查

干脆把可能会有 `cudaErrorCudartUnloading` 的 CUDA runtime API 去掉

把 CUDA runtime API 放到一个独立的 de-initialisation、finalize 之类的接口，让用户在 `main` 函数 `return` 前调用

土法黑科技（滑稽）

后记

参考资料

Just My Socks（注册教程内含优惠码）

cudaErrorCudartUnloading问题排查及建议方案

强制阻止 ”driver shutting down”？

如何使 CUDA runtime API 正常运作？

一切皆全局对象之过

解决方案

跳过 cudaErrorCudartUnloading 检查

干脆把可能会有 cudaErrorCudartUnloading 的 CUDA runtime API 去掉

把 CUDA runtime API 放到一个独立的 de-initialisation、finalize 之类的接口，让用户在 main 函数 return 前调用

土法黑科技（滑稽）

后记

参考资料

Just My Socks（注册教程 内含优惠码）

跳过 `cudaErrorCudartUnloading` 检查

干脆把可能会有 `cudaErrorCudartUnloading` 的 CUDA runtime API 去掉

把 CUDA runtime API 放到一个独立的 de-initialisation、finalize 之类的接口，让用户在 `main` 函数 `return` 前调用

Just My Socks（注册教程内含优惠码）