关于深度学习:PyTorch中的CUDA操作

76次阅读

共计 5562 个字符，预计需要花费 14 分钟才能阅读完成。

CUDA(Compute Unified Device Architecture)是 NVIDIA 推出的异构计算平台，PyTorch 中有专门的模块 torch.cuda 来设置和运行 CUDA 相干操作。本地装置环境为 Windows10，Python3.7.8 和 CUDA 11.6，装置 PyTorch 最新稳固版本 1.12.1 如下：

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

1. 查看 PyTorch 版本

 print(torch.__version__)
1.12.1+cu116

2. 查看 GPU 设施是否可用

 print(torch.cuda.is_available())
True

3.PyTorch 默认应用设施是 CPU

 print("default device: {}".format(torch.Tensor([4,5,6]).device))
default device: cpu

4. 查看所有可用的 cpu 设施的数量

 print("available cpu devices: {}".format(torch.cuda.os.cpu_count()))
available cpu devices: 20

这里 CPU 设施数量指的是逻辑处理器的数量。
5. 查看所有可用的 gpu 设施的数量

 print("available gpu devices: {}".format(torch.cuda.device_count()))
available gpu devices: 1

6. 获取 gpu 设施的名称

 print("gpu device name: {}".format(torch.cuda.get_device_name(torch.device("cuda:0"))))
gpu device name: NVIDIA GeForce GTX 1080 Ti

7. 通过 device=”cpu:0″ 指定 cpu:0 设施

 device = torch.Tensor([1,2,3], device="cpu:0").device
print("device type: {}".format(device))
device type: cpu

8. 通过 torch.device 指定 cpu:0 设施

 cpu1 = torch.device("cpu:0")
print("cpu device: {}:{}".format(cpu1.type, cpu1.index))
cpu device: cpu:0

9. 应用索引的形式，默认应用 CUDA 设施

 gpu = torch.device(0)
print("gpu device: {}:{}".format(gpu.type, gpu.index))
gpu device: cuda:0

10. 通过 torch.device(“cuda:0)指定 cuda:0 设施

 gpu = torch.device("cuda:0")
print("gpu device: {}:{}".format(gpu.type, gpu.index))
gpu device: cuda:0

默认状况下创立 Tensor 是在 CPU 设施上的，然而能够通过 copy_、to、cuda 等办法将 CPU 设施中的 Tensor 转移到 GPU 设施上。当然也是能够间接在 GPU 设施上创立 Tensor 的。torch.tensor 和 torch.Tensor 的区别是，torch.tensor 能够通过 device 指定 gpu 设施，而 torch.Tensor 只能在 cpu 上创立，否则报错。

 # 默认创立的 tensor 是在 cpu 上创立的
cpu_tensor = torch.Tensor([[1,4,7],[3,6,9],[2,5,8]])
print(cpu_tensor.device)
 
# 通过 to 办法将 cpu_tensor 拷贝到 gpu 上
gpu_tensor1 = cpu_tensor.to(torch.device("cuda:0"))
print(gpu_tensor1.device)
 
# 通过 cuda 办法将 cpu_tensor 拷贝到 gpu 上
gpu_tensor2 = cpu_tensor.cuda(torch.device("cuda:0"))
print(gpu_tensor2.device)
 
# 将 gpu_tensor2 拷贝到 cpu 上
gpu_tensor3 = cpu_tensor.copy_(gpu_tensor2)
print(gpu_tensor3.device)
print(gpu_tensor3)

输入后果如下：

 cpu
cuda:0
cuda:0
cpu
tensor([[1., 4., 7.],
        [3., 6., 9.],
        [2., 5., 8.]])

次要阐明下这个 copy_()办法，实现如下：

 def copy_(self, src, non_blocking=False):
    ......
    return _te.Tensor(*(), **{})

就是从 src 中拷贝元素到 self 的 tensor 中，而后返回 self。以 gpu_tensor3 = cpu_tensor.copy_(gpu_tensor2) 为例，就是把 gpu 中的 gpu_tensor2 拷贝到 cpu 中的 cpu_tensor 中。

 gpu_tensor1 = torch.tensor([[2,5,8],[1,4,7],[3,6,9]], device=torch.device("cuda:0"))
print(gpu_tensor1.device)
 
# 在 gpu 设施上创立随机数 tensor
print(torch.rand((3,4), device=torch.device("cuda:0")))
 
# 在 gpu 设施上创立 0 值 tensor
print(torch.zeros((2,5), device=torch.device("cuda:0")))

输入后果，如下：

 cuda:0
tensor([[0.7061, 0.2161, 0.8219, 0.3354],
        [0.1697, 0.1730, 0.1400, 0.2825],
        [0.1771, 0.0473, 0.8411, 0.2318]], device='cuda:0')
tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]], device='cuda:0')

Steam 是 CUDA 命令线性执行的形象模式，调配给设施的 CUDA 命令依照入队序列的程序执行。每个设施都有一个默认的 Steam，也能够通过 torch.cuda.Stream()创立新的 Stream。如果不同 Stream 中的命令交互执行，那么就不能保障命令相对按程序执行。上面的这个例子不同的 Stream 就可能会产生谬误。

 cuda = torch.device("cuda")
# 创立默认的 stream，A 就是应用的默认 stream
s = torch.cuda.Stream()
A = torch.randn((1,10), device=cuda)
for i in range(100):
    # 在新的 stream 上对默认的 stream 上创立的 tensor 进行求和
    with torch.cuda.stream(s):
        # 存在的问题是：torch.sum()可能会在 torch.randn()之前执行
        B = torch.sum(A)
        print(B)

这个例子存在的问题是 torch.sum()可能会在 torch.randn()之前就执行。为了保障 Stream 中的命令相对按程序执行，接下来应用 Synchronize 同步办法解决下面例子的问题：

 cuda = torch.device("cuda")
s = torch.cuda.Stream()
A = torch.randn((1,10), device=cuda)
default_stream = torch.cuda.current_stream()
print("Default Stream: {}".format(default_stream))
# 期待创立 A 的 stream 执行结束
torch.cuda.Stream.synchronize(default_stream)
for i in range(100):
    # 在新的 stream 上对默认的 stream 上创立的 tensor 进行求和
    with torch.cuda.stream(s):
        print("current stream: {}".format(torch.cuda.current_stream()))
        B = torch.sum(A)
        print(B)

解决问题的思路就是通过 torch.cuda.Stream.synchronize(default_stream) 期待创立 A 的 stream 执行结束，而后再执行新的 Stream 中的指令。
除此之外，应用 memory_cached 办法获取缓存内存的大小，应用 max_memory_cached 办法获取最大缓存内存的大小，应用 max_memory_allocated 办法获取最大分配内存的大小。能够应用 empty_cache 办法开释无用的缓存内存。

缓存就是当计算机内存不足的时候，就会把内存中的数据存储到硬盘上。固定缓冲区就是说常驻内存，不能把这部分数据缓存到硬盘上。能够间接应用 pin_memory 办法或在 Tensor 上间接调用 pin_memory 办法将 Tensor 复制到固定缓冲区。为什么要做固定缓冲区呢？目标只有一个，就是把 CPU 上的固定缓冲区拷贝到 GPU 上时速度快。Tensor 上的 is_pinned 办法能够查看该 Tensor 是否加载到固定缓冲区中。

 from torch.utils.data._utils.pin_memory import pin_memory
x = torch.Tensor([[1,2,4], [5, 7, 9], [3, 7, 10]])
# 通过 pin_memory()办法将 x 复制到固定缓冲区
y = pin_memory(x)
# 在 tensor 上间接调用 pin_memory()办法将 tensor 复制到固定缓冲区
z = x.pin_memory()
# id()办法返回 tensor 的内存地址，pin_memory()返回 tensor 对象的拷贝，因而内存地址是不同的
print("id: {}".format(id(x)))
print("id: {}".format(id(y)))
print("id: {}".format(id(z)))
# 当 tensor 放入固定缓冲区后，就能够异步将数据复制到 gpu 设施上了
a = z.cuda(non_blocking=True)
print(a)
print("is_pinned: {}/{}".format(x.is_pinned(), z.is_pinned()))

输入后果如下所示：

 id: 1605289350472
id: 1605969660408
id: 1605969660248
tensor([[1.,  2.,  4.],
        [5.,  7.,  9.],
        [3.,  7., 10.]], device='cuda:0')
is_pinned: False/True

阐明：通过 id()查看对象的内存地址。

主动设施感知实质上就是有 GPU 时就应用 GPU，没有 GPU 时就应用 CPU，即一套代码适配 CPU 和 GPU 设施。GPU 是否存在是通过 torch.cuda.is_available()判断的。常见的写法如下：

 device = torch.device("cpu")
if torch.cuda.is_available():
    device = torch.device("cuda")
a = torch.tensor([1,2,3], device=device)
print(a)

输入后果如下所示：

tensor([1, 2, 3], device='cuda:0')

在 Module 对象上调用 to()办法能够把模型也迁徙到 GPU 设施上，如下所示：

 class LinearRegression(torch.nn.Module):
    def __init__(self):
        super(LinearRegression, self).__init__()
        self.linear = torch.nn.Linear(1, 1)
    def forward(self, x):
        return self.linear(x)
regression = LinearRegression().to(device=device)
for param in regression.parameters():
    print(param)

从上述输入参数中能够看到 param 都是 device=’cuda:0’ 上的 tensor，所以能够说模型通过 to()迁徙到 GPU 设施上了。

参考文献：
[1]PyTorch 官网：https://pytorch.org/
[2]PyTorch 中文官网教程 1.7：https://pytorch.apachecn.org/…
[3]PyTorch GitHub：https://github.com/pytorch/py…
[4]TORCH.CUDA：https://pytorch.org/docs/stab…
[5]CUDA SEMANTICS：https://pytorch.org/docs/stab…
[6]PyTorch 深度学习实战

本文由 mdnice 多平台公布

正文完

深度学习

发表至：深度学习

2022-08-24

0

关于深度学习:常用的表格检测识别方法表格结构识别方法-下

关于深度学习:恒源云云GPU服务器如何使用-TensorBoard

关于深度学习:导入为什么深度学习是下一代技术革命的入口

关于深度学习:本地浏览器查看远程服务器的tensorboardhttplocalhost6006

关于前端:教你做小游戏-H5小游戏技术选型分析低代码小游戏框架canvas或SVG还能用React

关于深度学习:PyTorch中的CUDA操作

一. 常见 CPU 和 GPU 操作命令

二.CPU 和 GPU 设施上的 Tensor

1.Tensor 从 CPU 拷贝到 GPU 上

2. 间接在 GPU 上创立 Tensor

3.CUDA Streams

三. 固定缓冲区

四. 主动设施感知

1. 适配 CPU 和 GPU 设施

2. 模型迁徙到 GPU 设施

Just My Socks（注册教程内含优惠码）

	print("default device: {}".format(torch.Tensor([4,5,6]).device))
	default device: cpu

	print("available cpu devices: {}".format(torch.cuda.os.cpu_count()))
	available cpu devices: 20

	print("available gpu devices: {}".format(torch.cuda.device_count()))
	available gpu devices: 1

	print("gpu device name: {}".format(torch.cuda.get_device_name(torch.device("cuda:0"))))
	gpu device name: NVIDIA GeForce GTX 1080 Ti

	device = torch.Tensor([1,2,3], device="cpu:0").device
	print("device type: {}".format(device))
	device type: cpu

	cpu1 = torch.device("cpu:0")
	print("cpu device: {}:{}".format(cpu1.type, cpu1.index))
	cpu device: cpu:0

	gpu = torch.device(0)
	print("gpu device: {}:{}".format(gpu.type, gpu.index))
	gpu device: cuda:0

	gpu = torch.device("cuda:0")
	print("gpu device: {}:{}".format(gpu.type, gpu.index))
	gpu device: cuda:0

	# 默认创立的 tensor 是在 cpu 上创立的
	cpu_tensor = torch.Tensor([[1,4,7],[3,6,9],[2,5,8]])
	print(cpu_tensor.device)

	# 通过 to 办法将 cpu_tensor 拷贝到 gpu 上
	gpu_tensor1 = cpu_tensor.to(torch.device("cuda:0"))
	print(gpu_tensor1.device)

	# 通过 cuda 办法将 cpu_tensor 拷贝到 gpu 上
	gpu_tensor2 = cpu_tensor.cuda(torch.device("cuda:0"))
	print(gpu_tensor2.device)

	# 将 gpu_tensor2 拷贝到 cpu 上
	gpu_tensor3 = cpu_tensor.copy_(gpu_tensor2)
	print(gpu_tensor3.device)
	print(gpu_tensor3)

	cpu
	cuda:0
	cuda:0
	cpu
	tensor([[1., 4., 7.],
	[3., 6., 9.],
	[2., 5., 8.]])

	def copy_(self, src, non_blocking=False):
	......
	return _te.Tensor((), *{})

	gpu_tensor1 = torch.tensor([[2,5,8],[1,4,7],[3,6,9]], device=torch.device("cuda:0"))
	print(gpu_tensor1.device)

	# 在 gpu 设施上创立随机数 tensor
	print(torch.rand((3,4), device=torch.device("cuda:0")))

	# 在 gpu 设施上创立 0 值 tensor
	print(torch.zeros((2,5), device=torch.device("cuda:0")))

	cuda:0
	tensor([[0.7061, 0.2161, 0.8219, 0.3354],
	[0.1697, 0.1730, 0.1400, 0.2825],
	[0.1771, 0.0473, 0.8411, 0.2318]], device='cuda:0')
	tensor([[0., 0., 0., 0., 0.],
	[0., 0., 0., 0., 0.]], device='cuda:0')

	cuda = torch.device("cuda")
	# 创立默认的 stream，A 就是应用的默认 stream
	s = torch.cuda.Stream()
	A = torch.randn((1,10), device=cuda)
	for i in range(100):
	# 在新的 stream 上对默认的 stream 上创立的 tensor 进行求和
	with torch.cuda.stream(s):
	# 存在的问题是：torch.sum()可能会在 torch.randn()之前执行
	B = torch.sum(A)
	print(B)

	from torch.utils.data._utils.pin_memory import pin_memory
	x = torch.Tensor([[1,2,4], [5, 7, 9], [3, 7, 10]])
	# 通过 pin_memory()办法将 x 复制到固定缓冲区
	y = pin_memory(x)
	# 在 tensor 上间接调用 pin_memory()办法将 tensor 复制到固定缓冲区
	z = x.pin_memory()
	# id()办法返回 tensor 的内存地址，pin_memory()返回 tensor 对象的拷贝，因而内存地址是不同的
	print("id: {}".format(id(x)))
	print("id: {}".format(id(y)))
	print("id: {}".format(id(z)))
	# 当 tensor 放入固定缓冲区后，就能够异步将数据复制到 gpu 设施上了
	a = z.cuda(non_blocking=True)
	print(a)
	print("is_pinned: {}/{}".format(x.is_pinned(), z.is_pinned()))

	id: 1605289350472
	id: 1605969660408
	id: 1605969660248
	tensor([[1., 2., 4.],
	[5., 7., 9.],
	[3., 7., 10.]], device='cuda:0')
	is_pinned: False/True

	device = torch.device("cpu")
	if torch.cuda.is_available():
	device = torch.device("cuda")
	a = torch.tensor([1,2,3], device=device)
	print(a)

	class LinearRegression(torch.nn.Module):
	def __init__(self):
	super(LinearRegression, self).__init__()
	self.linear = torch.nn.Linear(1, 1)
	def forward(self, x):
	return self.linear(x)
	regression = LinearRegression().to(device=device)
	for param in regression.parameters():
	print(param)

关于深度学习:PyTorch中的CUDA操作

一. 常见 CPU 和 GPU 操作命令

二.CPU 和 GPU 设施上的 Tensor

1.Tensor 从 CPU 拷贝到 GPU 上

2. 间接在 GPU 上创立 Tensor

3.CUDA Streams

三. 固定缓冲区

四. 主动设施感知

1. 适配 CPU 和 GPU 设施

2. 模型迁徙到 GPU 设施

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）