关于gpu:优雅的在NVIDIA-GPU上实现sleep

当咱们在测试或者其余状况下，兴许须要 GPU 执行完某步后 sleep 一会儿。这时咱们就能够通过 cuda 所提供的 C 编程接口 clock64() 这个函数来实现。这里摘录一段 cuda 手册中对 clock64() 函数的阐明：

when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle. Sampling this counter at the beginning and at the end of a kernel, taking the difference of the two samples, and recording the result per thread provides a measure for each thread of the number of clock cycles taken by the device to completely execute the thread, but not of the number of clock cycles the device actually spent executing thread instructions. The former number is greater than the latter since threads are time sliced.

clock64()这个函数将返回线程所处的 SM 上的时钟周期数。如果在线程的开始和完结进行采样，并获取差值，将取得线程执行所破费的总时钟周期数，这将比线程理论运行的时钟周期数稍大，因为 SM 上多个线程之间是分工夫片执行的。

因而为了优雅的实现设施上的延时函数，咱们将在设施上调用 clock64() 这个函数，其函数原型为long long int clock64()，具体实现如下：

#define CLOCK_RATE 1695000  /* modify for different device */
__device__ void sleep(float t) {clock_t t0 = clock64();
    clock_t t1 = t0;
    while ((t1 - t0)/(CLOCK_RATE*1000.0f) < t)
        t1 = clock64();}

以上代码中的 CLOCK_RATE 可通过如下形式取得：

cudaDeviceProp  prop;
cudaGetDeviceProperties(&prop, 0); 
clock_t clock_rate = prop.clockRate;

此处取得的时钟频率单位为 kilohertz，因而 sleep 函数中为取得以秒为单位的延时，须要采纳CLOCK_RATE*1000.0f 这种形式。

残缺代码可见。