关于深度学习:PyTorch中的CUDA操作

CUDA(Compute Unified Device Architecture)是NVIDIA推出的异构计算平台，PyTorch中有专门的模块torch.cuda来设置和运行CUDA相干操作。本地装置环境为Windows10，Python3.7.8和CUDA 11.6，装置PyTorch最新稳固版本1.12.1如下：

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

一.常见CPU和GPU操作命令

1.查看PyTorch版本

print(torch.__version__)1.12.1+cu116

2.查看GPU设施是否可用

print(torch.cuda.is_available())True

3.PyTorch默认应用设施是CPU

print("default device: {}".format(torch.Tensor([4,5,6]).device))default device: cpu

4.查看所有可用的cpu设施的数量

print("available cpu devices: {}".format(torch.cuda.os.cpu_count()))available cpu devices: 20

这里CPU设施数量指的是逻辑处理器的数量。
5.查看所有可用的gpu设施的数量

print("available gpu devices: {}".format(torch.cuda.device_count()))available gpu devices: 1

6.获取gpu设施的名称

print("gpu device name: {}".format(torch.cuda.get_device_name(torch.device("cuda:0"))))gpu device name: NVIDIA GeForce GTX 1080 Ti

7.通过device="cpu:0"指定cpu:0设施

device = torch.Tensor([1,2,3], device="cpu:0").deviceprint("device type: {}".format(device))device type: cpu

8.通过torch.device指定cpu:0设施

cpu1 = torch.device("cpu:0")print("cpu device: {}:{}".format(cpu1.type, cpu1.index))cpu device: cpu:0

9.应用索引的形式，默认应用CUDA设施

gpu = torch.device(0)print("gpu device: {}:{}".format(gpu.type, gpu.index))gpu device: cuda:0

10.通过torch.device("cuda:0)指定cuda:0设施

gpu = torch.device("cuda:0")print("gpu device: {}:{}".format(gpu.type, gpu.index))gpu device: cuda:0

二.CPU和GPU设施上的Tensor

默认状况下创立Tensor是在CPU设施上的，然而能够通过copy_、to、cuda等办法将CPU设施中的Tensor转移到GPU设施上。当然也是能够间接在GPU设施上创立Tensor的。torch.tensor和torch.Tensor的区别是，torch.tensor能够通过device指定gpu设施，而torch.Tensor只能在cpu上创立，否则报错。

1.Tensor从CPU拷贝到GPU上

# 默认创立的tensor是在cpu上创立的cpu_tensor = torch.Tensor([[1,4,7],[3,6,9],[2,5,8]])print(cpu_tensor.device)# 通过to办法将cpu_tensor拷贝到gpu上gpu_tensor1 = cpu_tensor.to(torch.device("cuda:0"))print(gpu_tensor1.device)# 通过cuda办法将cpu_tensor拷贝到gpu上gpu_tensor2 = cpu_tensor.cuda(torch.device("cuda:0"))print(gpu_tensor2.device)# 将gpu_tensor2拷贝到cpu上gpu_tensor3 = cpu_tensor.copy_(gpu_tensor2)print(gpu_tensor3.device)print(gpu_tensor3)

输入后果如下：

cpucuda:0cuda:0cputensor([[1., 4., 7.],        [3., 6., 9.],        [2., 5., 8.]])

次要阐明下这个copy_()办法，实现如下：

def copy_(self, src, non_blocking=False):    ......    return _te.Tensor(*(), **{})

就是从src中拷贝元素到self的tensor中，而后返回self。以gpu_tensor3 = cpu_tensor.copy_(gpu_tensor2)为例，就是把gpu中的gpu_tensor2拷贝到cpu中的cpu_tensor中。

2.间接在GPU上创立Tensor

gpu_tensor1 = torch.tensor([[2,5,8],[1,4,7],[3,6,9]], device=torch.device("cuda:0"))print(gpu_tensor1.device)# 在gpu设施上创立随机数tensorprint(torch.rand((3,4), device=torch.device("cuda:0")))# 在gpu设施上创立0值tensorprint(torch.zeros((2,5), device=torch.device("cuda:0")))

输入后果，如下：

cuda:0tensor([[0.7061, 0.2161, 0.8219, 0.3354],        [0.1697, 0.1730, 0.1400, 0.2825],        [0.1771, 0.0473, 0.8411, 0.2318]], device='cuda:0')tensor([[0., 0., 0., 0., 0.],        [0., 0., 0., 0., 0.]], device='cuda:0')

3.CUDA Streams

Steam是CUDA命令线性执行的形象模式，调配给设施的CUDA命令依照入队序列的程序执行。每个设施都有一个默认的Steam，也能够通过torch.cuda.Stream()创立新的Stream。如果不同Stream中的命令交互执行，那么就不能保障命令相对按程序执行。上面的这个例子不同的Stream就可能会产生谬误。

cuda = torch.device("cuda")# 创立默认的stream，A就是应用的默认streams = torch.cuda.Stream()A = torch.randn((1,10), device=cuda)for i in range(100):    # 在新的stream上对默认的stream上创立的tensor进行求和    with torch.cuda.stream(s):        # 存在的问题是：torch.sum()可能会在torch.randn()之前执行        B = torch.sum(A)        print(B)

这个例子存在的问题是torch.sum()可能会在torch.randn()之前就执行。为了保障Stream中的命令相对按程序执行，接下来应用Synchronize同步办法解决下面例子的问题：

cuda = torch.device("cuda")s = torch.cuda.Stream()A = torch.randn((1,10), device=cuda)default_stream = torch.cuda.current_stream()print("Default Stream: {}".format(default_stream))# 期待创立A的stream执行结束torch.cuda.Stream.synchronize(default_stream)for i in range(100):    # 在新的stream上对默认的stream上创立的tensor进行求和    with torch.cuda.stream(s):        print("current stream: {}".format(torch.cuda.current_stream()))        B = torch.sum(A)        print(B)

解决问题的思路就是通过torch.cuda.Stream.synchronize(default_stream)期待创立A的stream执行结束，而后再执行新的Stream中的指令。
除此之外，应用memory_cached办法获取缓存内存的大小，应用max_memory_cached办法获取最大缓存内存的大小，应用max_memory_allocated办法获取最大分配内存的大小。能够应用empty_cache办法开释无用的缓存内存。

三.固定缓冲区

缓存就是当计算机内存不足的时候，就会把内存中的数据存储到硬盘上。固定缓冲区就是说常驻内存，不能把这部分数据缓存到硬盘上。能够间接应用pin_memory办法或在Tensor上间接调用pin_memory办法将Tensor复制到固定缓冲区。为什么要做固定缓冲区呢？目标只有一个，就是把CPU上的固定缓冲区拷贝到GPU上时速度快。Tensor上的is_pinned办法能够查看该Tensor是否加载到固定缓冲区中。

from torch.utils.data._utils.pin_memory import pin_memoryx = torch.Tensor([[1,2,4], [5, 7, 9], [3, 7, 10]])# 通过pin_memory()办法将x复制到固定缓冲区y = pin_memory(x)# 在tensor上间接调用pin_memory()办法将tensor复制到固定缓冲区z = x.pin_memory()# id()办法返回tensor的内存地址，pin_memory()返回tensor对象的拷贝，因而内存地址是不同的print("id: {}".format(id(x)))print("id: {}".format(id(y)))print("id: {}".format(id(z)))# 当tensor放入固定缓冲区后，就能够异步将数据复制到gpu设施上了a = z.cuda(non_blocking=True)print(a)print("is_pinned: {}/{}".format(x.is_pinned(), z.is_pinned()))

输入后果如下所示：

id: 1605289350472id: 1605969660408id: 1605969660248tensor([[ 1.,  2.,  4.],        [ 5.,  7.,  9.],        [ 3.,  7., 10.]], device='cuda:0')is_pinned: False/True

阐明：通过id()查看对象的内存地址。

四.主动设施感知

1.适配CPU和GPU设施

主动设施感知实质上就是有GPU时就应用GPU，没有GPU时就应用CPU，即一套代码适配CPU和GPU设施。GPU是否存在是通过torch.cuda.is_available()判断的。常见的写法如下：

device = torch.device("cpu")if torch.cuda.is_available():    device = torch.device("cuda")a = torch.tensor([1,2,3], device=device)print(a)

输入后果如下所示：

tensor([1, 2, 3], device='cuda:0')

2.模型迁徙到GPU设施

在Module对象上调用to()办法能够把模型也迁徙到GPU设施上，如下所示：

class LinearRegression(torch.nn.Module):    def __init__(self):        super(LinearRegression, self).__init__()        self.linear = torch.nn.Linear(1, 1)    def forward(self, x):        return self.linear(x)regression = LinearRegression().to(device=device)for param in regression.parameters():    print(param)

从上述输入参数中能够看到param都是device='cuda:0'上的tensor，所以能够说模型通过to()迁徙到GPU设施上了。

参考文献：
[1]PyTorch官网：https://pytorch.org/
[2]PyTorch中文官网教程1.7：https://pytorch.apachecn.org/...
[3]PyTorch GitHub：https://github.com/pytorch/py...
[4]TORCH.CUDA：https://pytorch.org/docs/stab...
[5]CUDA SEMANTICS：https://pytorch.org/docs/stab...
[6]PyTorch深度学习实战

本文由mdnice多平台公布