关于深度学习:PyTorch中的多GPU训练DistributedDataParallel

在 pytorch 中的多 GPU 训练个别有 2 种 DataParallel（DP）和 DistributedDataParallel（DDP），DataParallel 是最简略的的单机多卡实现，然而它应用多线程模型，并不可能在多机多卡的环境下应用，所以本文将介绍 DistributedDataParallel，DDP 基于应用多过程而不是应用多线程的 DP，并且存在 GIL 争用问题，并且能够裁减到多机多卡的环境，所以他是分布式多 GPU 训练的首选。

这里应用的版本为：python 3.8、pytorch 1.11、CUDA 11.4

如上图所示，每个 GPU 将复制模型并依据可用 GPU 的数量调配数据样本的子集。

对于 100 个数据集和 4 个 GPU，每个 GPU 每次迭代将解决 25 个数据集。

DDP 上的同步产生在构造函数、正向流传和反向流传上。在反向流传中梯度的平均值被流传到每个 GPU。

无关其余同步详细信息，请查看应用 PyTorch 官网文档：Writing Distributed Applications with PyTorch。

为了 Forking 多个过程，咱们应用了 Torch 多现成解决框架。一旦产生了过程，第一个参数就是过程的索引，通常称为 rank。

在上面的示例中，调用该办法的所有衍生过程都将具备从 0 到 3 的 rank 值。咱们能够应用它来辨认各个过程，pytorch 会将 rank = 0 的过程作为根本过程。

 import torch.multiprocessing as mp
 // number of GPUs equal to number of processes
 world_size = torch.cuda.device_count()
 mp.spawn(<selfcontainedmethodforeachproc>, nprocs=world_size, args=(args,))

将 GPU 调配给为训练生成的每个过程。

 import torch
 import torch.distributed as dist
 
 def train(self, rank, args):
 
     current_gpu_index = rank
     torch.cuda.set_device(current_gpu_index)
 
     dist.init_process_group(
         backend='nccl', world_size=args.world_size, 
         rank=current_gpu_index,
         init_method='env://'
     )

对于解决图像，咱们将应用规范的 ImageFolder 加载器，它须要以下格局的样例数据。

 <basedir>/testset/<categoryname>/<listofimages>
 <basedir>/valset/<categoryname>/<listofimages>
 <basedir>/trainset/<categoryname>/<listofimages>

上面咱们配置 Dataloader：

 from torchvision.datasets import ImageFolder
 
 train_dataset = ImageFolder(root=os.path.join(<basedir>, "trainset"), transform=train_transform)

当 DistributedSample 与 DDP 一起应用时，他会为每个过程 /GPU 提供一个子集。

 from torch.utils.data import DistributedSampler
 
 dist_train_samples = DistributedSampler(dataset=train_dataset, num_replicas =4, rank=rank, seed=17)

DistributedSampler 与 DataLoader 进行整合

 from torch.utils.data import DataLoader
 
 train_loader = DataLoader(
     train_dataset,
     batch_size=self.BATCH_SIZE,
     num_workers=4,
     sampler=dist_train_samples,
     pin_memory=True,
 )

对于多卡训练在初始化模型后，还要将其调配给每个 GPU。

 from torch.nn.parallel import DistributedDataParallel as DDP
 from torchvision import models as models
 
 model = models.resnet34(pretrained=True)
 loss_fn = nn.CrossEntropyLoss()
 
 model.cuda(current_gpu_index)
 model = DDP(model)
 
 loss_fn.cuda(current_gpu_index)
 
 optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.module.parameters()), lr=1e-3)
 scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7)

训练开始时须要在 DistributedSampler 上设置 epoch，这样数据在 epoch 之间进行打乱，并且保障在每个 epoch 中应用雷同的排序。

 for epoch in range(1, self.EPOCHS+1):
     dist_train_samples.set_epoch(epoch)

对于 DataLoader 中的每个批次，将输出传递给 GPU 并计算梯度。

 for cur_iter_data in (loaders["train"]):
     inputs, labels = cur_iter_data
     inputs, labels = inputs.cuda(current_gpu_index, non_blocking=True),labels.cuda(current_gpu_index, non_blocking=True)
 
     optimizer.zero_grad(set_to_none=True)
     with torch.set_grad_enabled(phase == 'train'):
         outputs = model(inputs)
         _, preds = torch.max(outputs, 1)
         loss = loss_fn(outputs, labels)
         loss.backward()
         optimizer.step()
 
 scheduler.step()

比照训练轮次的精度，如果更好则存储模型的权重。

 if rank % args.n_gpus == 0 :
 torch.save(model.module.state_dict(), os.path.join(os.getcwd(), "scripts/model", args.model_file_name))

在训练完结时把模型权重保留在 ’ pth ‘ 文件中，这样能够将该文件加载到 CPU 或 GPU 上进行推理。

从文件加载模型:

 load_path = os.path.join(os.getcwd(), "scripts/model", args.model_file_name)
 model_image_classifier = ImageClassifier()
 model_image_classifier.load_state_dict(torch.load(load_path), strict=False
 )
 model_image_classifier.cuda(current_gpu_index)
 model_image_classifier = DDP(model_image_classifier)
 
 model_image_classifier = model_image_classifier.eval()

这样就能够应用通常的推理过程来应用模型了。

以上就是 PyTorch 的 DistributedDataParallel 的基本知识，DistributedDataParallel 既可单机多卡又可多机多卡。

DDP 在各过程梯度计算实现之后各过程须要将梯度进行汇总均匀, 而后再由 rank=0 的过程, 将其播送到所有过程, 各过程用该梯度来独立的更新参数。因为 DDP 各过程中的模型, 初始参数统一 (初始时刻进行一次播送), 而每次用于更新参数的梯度也统一的, 所以各过程的模型参数始终保持统一。

DP 的解决则是梯度汇总到 GPU0, 反向流传更新参数, 再播送参数给其余残余的 GPU。在 DP 中, 全程保护一个 optimizer, 对各个 GPU 上梯度进行汇总，在主卡进行参数更新, 之后再将模型参数播送到其余 GPU。

所以相较于 DP, DDP 传输的数据量更少, 因而速度更快, 效率更高。并且如果你应用过 DP 就会发现，在应用时 GPU0 的占用率始终会比其余 GPU 要高，也就是说会更忙一点，这就是因为 GPU0 做了一些额定的工作，所以也会导致效率变低。所以如果多卡训练倡议应用 DDP 进行，然而如果模型比较简单例如 2 个 GPU 也不须要多机的状况下，那么 DP 的代码改变是最小的，能够作为长期计划应用。

https://avoid.overfit.cn/post/278382575559496e844634b6671330e4

作者：Kaustav Mandal

关于深度学习:PyTorch中的多GPU训练DistributedDataParallel

Forking 的过程

GPU 过程调配

多过程的 Dataloader

模型初始化

训练

推理

总结