关于人工智能:MindSpore易点通精讲系列模型训练之GPU分布式并行训练

40次阅读

共计 14936 个字符,预计需要花费 38 分钟才能阅读完成。

Dive Into MindSpore – Distributed Training With GPU For Model TrainMindSpore 易点通·精讲系列–模型训练之 GPU 分布式并行训练本文开发环境 Ubuntu 20.04Python 3.8MindSpore 1.7.0OpenMPI 4.0.3RTX 1080Ti * 4 本文内容摘要基础知识环境搭建单卡训练多卡训练–OpenMPI 多卡训练–非 OpenMPI 本文总结遇到问题本文参考 1. 基础知识 1.1 概念介绍在深度学习中,随着模型和数据的一直增长,在很多状况下须要应用单机多卡或者多机多卡进行训练,即分布式训练。分布式训练策略依照并行形式不同,能够简略的分为数据并行和模型并行两种形式。数据并行数据并行是指在不同的 GPU 上都 copy 保留一份模型的正本,而后将不同的数据调配到不同的 GPU 上进行计算,最初将所有 GPU 计算的后果进行合并,从而达到减速模型训练的目标。模型并行与数据并行不同,分布式训练中的模型并行是指将整个神经网络模型拆解散布到不同的 GPU 中,不同的 GPU 负责计算网络模型中的不同局部。这通常是在网络模型很大很大、单个 GPU 的显存曾经齐全装不下整体网络的状况下才会采纳。

1.2 MindSpore 中的反对 1.1 中介绍了实践中的并行形式,具体到 MIndSpore 框架中,目前反对下述的四种并行模式:数据并行:用户的网络参数规模在单卡上能够计算的状况下应用。这种模式会在每卡上复制雷同的网络参数,训练时输出不同的训练数据,适宜大部分用户应用。半自动并行:用户的神经网络在单卡上无奈计算,并且对切分的性能存在较大的需要。用户能够设置这种运行模式,手动指定每个算子的切分策略,达到较佳的训练性能。主动并行:用户的神经网络在单卡上无奈计算,然而不晓得如何配置算子策略。用户启动这种模式,MindSpore 会主动针对每个算子进行配置策略,适宜想要并行训练然而不晓得如何配置策略的用户。混合并行:齐全由用户本人设计并行训练的逻辑和实现,用户能够本人在网络中定义 AllGather 等通信算子。适宜相熟并行训练的用户。对于大部分用户来说,其实可能用到的是数据并行模式,所以上面的案例中,会以数据并行模式来开展解说。2. 环境搭建 2.1 MindSpore 装置略。可参考笔者之前的文章 MindSpore 入门–基于 GPU 服务器装置 MindSpore 1.5.0,留神将文章中的 MindSpore 版本升级到 1.7.0。2.2 OpenMPI 装置在 GPU 硬件平台上,MindSpore 采纳 OpenMPI 的 mpirun 进行分布式训练。所以咱们先来装置 OpenMPI。本文装置的是 4.0.3 版本,装置命令如下:wget -c https://download.open-mpi.org…
tar xf openmpi-4.0.3.tar.gz
cd openmpi-4.0.3/
./configure –prefix=/usr/local/openmpi-4.0.3
make -j 16
sudo make install
echo -e “export PATH=/usr/local/openmpi-4.0.3/bin:$PATH” >> ~/.bashrc
echo -e “export LD_LIBRARY_PATH=/usr/local/openmpi-4.0.3/lib:$LD_LIBRARY_PATH” >> ~/.bashrc
source ~/.bashrc
应用 mpirun –version 命令验证是否装置胜利,输入如下内容:mpirun (Open MPI) 4.0.3

Report bugs to http://www.open-mpi.org/commu…
2.3 环境验证下面根底环境装置实现后,咱们对环境进行一个初步验证,来看看是否搭建胜利。验证代码如下:# nccl_allgather.py
import numpy as np
import mindspore.ops as ops
import mindspore.nn as nn
from mindspore import context, Tensor
from mindspore.communication import init, get_rank

class Net(nn.Cell):

def __init__(self):
    super(Net, self).__init__()
    self.allgather = ops.AllGather()

def construct(self, x):
    return self.allgather(x)

if name == “__main__”:

context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
init("nccl")
value = get_rank()
input_x = Tensor(np.array([[value]]).astype(np.float32))
net = Net()
output = net(input_x)
print(output)

将下面代码保留到文件 nccl_allgather.py 中,运行命令:命令解读:-n 前面数字代表应用 GPU 的数量,这里应用了机器内全副 GPU。如果读者不想应用全副,记得设置相应的环境变量。mpirun -n 4 python3 nccl_allgather.py
输入内容如下:[[0.]
[1.]
[2.]
[3.]]
[[0.]
[1.]
[2.]
[3.]]
[[0.]
[1.]
[2.]
[3.]]
[[0.]
[1.]
[2.]
[3.]]
至此,咱们的环境搭建实现,且验证胜利。3. 单卡训练为了可能后续进行比照测试,这里咱们先来进行单卡训练,以此做为基准。3.1 代码局部代码阐明:网络结构采纳的是 ResNet-50,读者能够在 MindSpore Models 仓库进行获取,复制粘贴过去即可,ResNet-50 代码链接。数据集采纳的是 Fruit-360 数据集,无关该数据集的更具体介绍能够参看笔者之前的文章 MindSpore 易点通·精讲系列–数据集加载之 ImageFolderDataset。数据集下载链接读者留神将代码中的 train_dataset_dir 和 test_dataset_dir 替换为本人的文件目录。单卡训练的代码如下:import numpy as np

from mindspore import context
from mindspore import nn
from mindspore.common import dtype as mstype
from mindspore.common import set_seed
from mindspore.common import Tensor
from mindspore.communication import init, get_rank, get_group_size
from mindspore.dataset import ImageFolderDataset
from mindspore.dataset.transforms.c_transforms import Compose, TypeCast
from mindspore.dataset.vision.c_transforms import HWC2CHW, Normalize, RandomCrop, RandomHorizontalFlip, Resize
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from mindspore.nn.optim import Momentum
from mindspore.ops import operations as P
from mindspore.ops import functional as F
from mindspore.train import Model
from mindspore.train.callback import CheckpointConfig, ModelCheckpoint, LossMonitor
from scipy.stats import truncnorm

define reset50

def create_dataset(dataset_dir, mode=”train”, decode=True, batch_size=32, repeat_num=1):

if mode == "train":
    shuffle = True
else:
    shuffle = False

dataset = ImageFolderDataset(dataset_dir=dataset_dir, shuffle=shuffle, decode=decode)

mean = [127.5, 127.5, 127.5]
std = [127.5, 127.5, 127.5]
if mode == "train":
    transforms_list = Compose([RandomCrop((32, 32), (4, 4, 4, 4)),
         RandomHorizontalFlip(),
         Resize((100, 100)),
         Normalize(mean, std),
         HWC2CHW()])
else:
    transforms_list = Compose([Resize((128, 128)),
         Normalize(mean, std),
         HWC2CHW()])

cast_op = TypeCast(mstype.int32)

dataset = dataset.map(operations=transforms_list, input_columns="image")
dataset = dataset.map(operations=cast_op, input_columns="label")
dataset = dataset.batch(batch_size=batch_size, drop_remainder=True)
dataset = dataset.repeat(repeat_num)

return dataset

def run_train():

context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
set_seed(0)

train_dataset_dir = "/mnt/data_0002_24t/xingchaolong/dataset/Fruits_360/fruits-360_dataset/fruits-360/Training"
test_dataset_dir = "/mnt/data_0002_24t/xingchaolong/dataset/Fruits_360/fruits-360_dataset/fruits-360/Test"
batch_size = 32

train_dataset = create_dataset(dataset_dir=train_dataset_dir, batch_size=batch_size)
test_dataset = create_dataset(dataset_dir=test_dataset_dir, mode="test")
train_batch_num = train_dataset.get_dataset_size()
test_batch_num = test_dataset.get_dataset_size()
print("train dataset batch num: {}".format(train_batch_num), flush=True)
print("test dataset batch num: {}".format(test_batch_num), flush=True)

# build model
net = resnet50(class_num=131)
loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
optim = Momentum(params=net.trainable_params(), learning_rate=0.01, momentum=0.9, loss_scale=1024.0)
model = Model(net, loss_fn=loss, optimizer=optim, metrics={"accuracy"})

# CheckPoint CallBack definition
config_ck = CheckpointConfig(save_checkpoint_steps=train_batch_num, keep_checkpoint_max=35)
ckpoint_cb = ModelCheckpoint(prefix="fruit_360_renet50", directory="./ckpt/", config=config_ck)
# LossMonitor is used to print loss value on screen
loss_cb = LossMonitor()

# model train
model.train(10, train_dataset, callbacks=[ckpoint_cb, loss_cb], dataset_sink_mode=True)

# model eval
result = model.eval(test_dataset)
print("eval result: {}".format(result), flush=True)

def main():

run_train()

if name == “__main__”:

main()

3.2 训练局部保留代码到 gpu_single_train.py,应用如下命令进行训练:export CUDA_VISIBLE_DEVICES=0
python3 gpu_single_train.py
训练过程输入内容如下:train dataset batch num: 2115
test dataset batch num: 709
epoch: 1 step: 2115, loss is 4.219570636749268
epoch: 2 step: 2115, loss is 3.7109947204589844
……
epoch: 9 step: 2115, loss is 2.66499400138855
epoch: 10 step: 2115, loss is 2.540522336959839
eval result: {‘accuracy’: 0.676348730606488}
应用 tree ckpt 命令,查看一下模型保留目录的状况,输入内容如下:ckpt/
├── fruit_360_renet50-10_2115.ckpt
├── fruit_360_renet50-1_2115.ckpt
……
├── fruit_360_renet50-9_2115.ckpt
└── fruit_360_renet50-graph.meta

  1. 多卡训练–OpenMPI 上面咱们通过理论案例,介绍如何在 GPU 平台上,采纳 OpenMPI 进行分布式训练。4.1 代码局部代码阐明:前三点阐明请参考 3.1 局部的代码阐明。多卡训练次要批改的是数据集读取和 context 设置局部。数据集读取:须要指定 num_shards 和 shard_id,具体内容参考代码。context 设置:蕴含参数一致性和并行模式设定。参数一致性这里应用的是 set_seed 来设定;并行模式通过 set_auto_parallel_context 办法和 parallel_mode 参数来进行设置。多卡训练的代码如下:import numpy as np

from mindspore import context
from mindspore import nn
from mindspore.common import dtype as mstype
from mindspore.common import set_seed
from mindspore.common import Tensor
from mindspore.communication import init, get_rank, get_group_size
from mindspore.dataset import ImageFolderDataset
from mindspore.dataset.transforms.c_transforms import Compose, TypeCast
from mindspore.dataset.vision.c_transforms import HWC2CHW, Normalize, RandomCrop, RandomHorizontalFlip, Resize
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from mindspore.nn.optim import Momentum
from mindspore.ops import operations as P
from mindspore.ops import functional as F
from mindspore.train import Model
from mindspore.train.callback import CheckpointConfig, ModelCheckpoint, LossMonitor
from scipy.stats import truncnorm

define reset50

def create_dataset(dataset_dir, mode=”train”, decode=True, batch_size=32, repeat_num=1):

if mode == "train":
    shuffle = True
    rank_id = get_rank()
    rank_size = get_group_size()
else:
    shuffle = False
    rank_id = None
    rank_size = None

dataset = ImageFolderDataset(dataset_dir=dataset_dir, shuffle=shuffle, decode=decode, num_shards=rank_size, shard_id=rank_id)

mean = [127.5, 127.5, 127.5]
std = [127.5, 127.5, 127.5]
if mode == "train":
    transforms_list = Compose([RandomCrop((32, 32), (4, 4, 4, 4)),
         RandomHorizontalFlip(),
         Resize((100, 100)),
         Normalize(mean, std),
         HWC2CHW()])
else:
    transforms_list = Compose([Resize((128, 128)),
         Normalize(mean, std),
         HWC2CHW()])

cast_op = TypeCast(mstype.int32)

dataset = dataset.map(operations=transforms_list, input_columns="image")
dataset = dataset.map(operations=cast_op, input_columns="label")
dataset = dataset.batch(batch_size=batch_size, drop_remainder=True)
dataset = dataset.repeat(repeat_num)

return dataset

def run_train():

context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
init("nccl")
rank_id = get_rank()
rank_size = get_group_size()
print("rank size: {}, rank id: {}".format(rank_size, rank_id), flush=True)
set_seed(0)
context.set_auto_parallel_context(device_num=rank_size, gradients_mean=True, parallel_mode=context.ParallelMode.DATA_PARALLEL)

train_dataset_dir = "/mnt/data_0002_24t/xingchaolong/dataset/Fruits_360/fruits-360_dataset/fruits-360/Training"
test_dataset_dir = "/mnt/data_0002_24t/xingchaolong/dataset/Fruits_360/fruits-360_dataset/fruits-360/Test"
batch_size = 32

train_dataset = create_dataset(dataset_dir=train_dataset_dir, batch_size=batch_size//rank_size)
test_dataset = create_dataset(dataset_dir=test_dataset_dir, mode="test")
train_batch_num = train_dataset.get_dataset_size()
test_batch_num = test_dataset.get_dataset_size()
print("train dataset batch num: {}".format(train_batch_num), flush=True)
print("test dataset batch num: {}".format(test_batch_num), flush=True)

# build model
net = resnet50(class_num=131)
loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
optim = Momentum(params=net.trainable_params(), learning_rate=0.01, momentum=0.9, loss_scale=1024.0)
model = Model(net, loss_fn=loss, optimizer=optim, metrics={"accuracy"})

# CheckPoint CallBack definition
config_ck = CheckpointConfig(save_checkpoint_steps=train_batch_num, keep_checkpoint_max=35)
ckpoint_cb = ModelCheckpoint(prefix="fruit_360_renet50_{}".format(rank_id), directory="./ckpt/", config=config_ck)
# LossMonitor is used to print loss value on screen
loss_cb = LossMonitor()

# model train
model.train(10, train_dataset, callbacks=[ckpoint_cb, loss_cb], dataset_sink_mode=True)

# model eval
result = model.eval(test_dataset)
print("eval result: {}".format(result), flush=True)

def main():

run_train()

if name == “__main__”:

main()

4.2 训练局部上面来介绍如何应用多卡 GPU 训练。4.2.1 4 卡 GPU 训练应用如下命令,进行 4 卡 GPU 训练:export CUDA_VISIBLE_DEVICES=0,1,2,3
mpirun -n 4 python3 gpu_distributed_train.py
训练过程中,输入内容如下:rank size: 4, rank id: 0
rank size: 4, rank id: 1
rank size: 4, rank id: 2
rank size: 4, rank id: 3
train dataset batch num: 2115
test dataset batch num: 709
train dataset batch num: 2115
test dataset batch num: 709
train dataset batch num: 2115
test dataset batch num: 709
train dataset batch num: 2115
test dataset batch num: 709
[WARNING] PRE_ACT(294248,7fa67e831740,python3):2022-07-13-17:11:24.528.381 [mindspore/ccsrc/backend/common/pass/communication_op_fusion.cc:198] GetAllReduceSplitSegment] Split threshold is 0. AllReduce nodes will take default fusion strategy.
[WARNING] PRE_ACT(294245,7f57993a5740,python3):2022-07-13-17:11:26.176.114 [mindspore/ccsrc/backend/common/pass/communication_op_fusion.cc:198] GetAllReduceSplitSegment] Split threshold is 0. AllReduce nodes will take default fusion strategy.
[WARNING] PRE_ACT(294247,7f36f889b740,python3):2022-07-13-17:11:30.475.177 [mindspore/ccsrc/backend/common/pass/communication_op_fusion.cc:198] GetAllReduceSplitSegment] Split threshold is 0. AllReduce nodes will take default fusion strategy.
[WARNING] PRE_ACT(294246,7f5f1820c740,python3):2022-07-13-17:11:31.271.259 [mindspore/ccsrc/backend/common/pass/communication_op_fusion.cc:198] GetAllReduceSplitSegment] Split threshold is 0. AllReduce nodes will take default fusion strategy.
epoch: 1 step: 2115, loss is 4.536644458770752
epoch: 1 step: 2115, loss is 4.347061634063721
epoch: 1 step: 2115, loss is 4.557111740112305
epoch: 1 step: 2115, loss is 4.467658519744873
……
epoch: 10 step: 2115, loss is 3.263073205947876
epoch: 10 step: 2115, loss is 3.169656753540039
epoch: 10 step: 2115, loss is 3.2040905952453613
epoch: 10 step: 2115, loss is 3.812671184539795
eval result: {‘accuracy’: 0.48113540197461213}
eval result: {‘accuracy’: 0.5190409026798307}
eval result: {‘accuracy’: 0.4886283497884344}
eval result: {‘accuracy’: 0.5010578279266573}
应用 tree ckpt 命令,查看一下模型保留目录的状况,输入内容如下:ckpt/
├── fruit_360_renet50_0-10_2115.ckpt
├── fruit_360_renet50_0-1_2115.ckpt
├── fruit_360_renet50_0-2_2115.ckpt
├── fruit_360_renet50_0-3_2115.ckpt
├── fruit_360_renet50_0-4_2115.ckpt
├── fruit_360_renet50_0-5_2115.ckpt
├── fruit_360_renet50_0-6_2115.ckpt
├── fruit_360_renet50_0-7_2115.ckpt
├── fruit_360_renet50_0-8_2115.ckpt
├── fruit_360_renet50_0-9_2115.ckpt
├── fruit_360_renet50_0-graph.meta
……
├── fruit_360_renet50_3-10_2115.ckpt
├── fruit_360_renet50_3-1_2115.ckpt
├── fruit_360_renet50_3-2_2115.ckpt
├── fruit_360_renet50_3-3_2115.ckpt
├── fruit_360_renet50_3-4_2115.ckpt
├── fruit_360_renet50_3-5_2115.ckpt
├── fruit_360_renet50_3-6_2115.ckpt
├── fruit_360_renet50_3-7_2115.ckpt
├── fruit_360_renet50_3-8_2115.ckpt
├── fruit_360_renet50_3-9_2115.ckpt
└── fruit_360_renet50_3-graph.meta
4.2.2 2 卡 GPU 训练为了进行比照,再来进行 2 卡 GPU 训练,命令如下:这里为了验证普遍性,并非依序抉择 GPU。export CUDA_VISIBLE_DEVICES=2,3
mpirun -n 2 python3 gpu_distributed_train.py
训练过程中,输入内容如下:rank size: 2, rank id: 0
rank size: 2, rank id: 1
train dataset batch num: 2115
test dataset batch num: 709
train dataset batch num: 2115
test dataset batch num: 709
[WARNING] PRE_ACT(295459,7ff930118740,python3):2022-07-13-17:31:07.210.231 [mindspore/ccsrc/backend/common/pass/communication_op_fusion.cc:198] GetAllReduceSplitSegment] Split threshold is 0. AllReduce nodes will take default fusion strategy.
[WARNING] PRE_ACT(295460,7f5fed564740,python3):2022-07-13-17:31:07.649.536 [mindspore/ccsrc/backend/common/pass/communication_op_fusion.cc:198] GetAllReduceSplitSegment] Split threshold is 0. AllReduce nodes will take default fusion strategy.
epoch: 1 step: 2115, loss is 4.391518592834473
epoch: 1 step: 2115, loss is 4.337993621826172
……
epoch: 10 step: 2115, loss is 2.7631659507751465
epoch: 10 step: 2115, loss is 3.0124118328094482
eval result: {‘accuracy’: 0.6057827926657263}
eval result: {‘accuracy’: 0.6202397743300423}
应用 tree ckpt 命令,查看一下模型保留目录的状况,输入内容如下:ckpt/
├── fruit_360_renet50_0-10_2115.ckpt
├── fruit_360_renet50_0-1_2115.ckpt
├── fruit_360_renet50_0-2_2115.ckpt
├── fruit_360_renet50_0-3_2115.ckpt
├── fruit_360_renet50_0-4_2115.ckpt
├── fruit_360_renet50_0-5_2115.ckpt
├── fruit_360_renet50_0-6_2115.ckpt
├── fruit_360_renet50_0-7_2115.ckpt
├── fruit_360_renet50_0-8_2115.ckpt
├── fruit_360_renet50_0-9_2115.ckpt
├── fruit_360_renet50_0-graph.meta
├── fruit_360_renet50_1-10_2115.ckpt
├── fruit_360_renet50_1-1_2115.ckpt
├── fruit_360_renet50_1-2_2115.ckpt
├── fruit_360_renet50_1-3_2115.ckpt
├── fruit_360_renet50_1-4_2115.ckpt
├── fruit_360_renet50_1-5_2115.ckpt
├── fruit_360_renet50_1-6_2115.ckpt
├── fruit_360_renet50_1-7_2115.ckpt
├── fruit_360_renet50_1-8_2115.ckpt
├── fruit_360_renet50_1-9_2115.ckpt
└── fruit_360_renet50_1-graph.meta
4.2.3 多卡比照阐明联合 3.2 局部,进行 4 卡 GPU 训练和 2 卡 GPU 训练的比照。三种状况下,别离将 batch_size 设置为了 32、8、16,对应到的 batch_num 不变。也能够认为是在 GPU 显存有余于反对更大 batch_size 时,通过多卡来实现更大 batch_size 的计划。从理论训练状况来看(都训练了 10 个 epoch),单卡的成果最好,2 卡次之,4 卡最差。导致这种状况的起因是因为网络中应用到了 BatchNorm2d 算子,而在多卡状况下,无奈跨卡计算,从而导致精度上的差异。在 GPU 硬件下,笔者临时并没有找到正当的解决方案。5. 多卡训练–非 OpenMPI 在 4 中咱们介绍了依赖 OpenMPI 如何来进行 GPU 多卡训练,同时 MindSpore 也反对不依赖 OpenMPI 来进行 GPU 多卡训练。官网对此的阐明如下:出于训练时的平安及可靠性要求,MindSpore GPU 还反对不依赖 OpenMPI 的分布式训练。OpenMPI 在分布式训练的场景中,起到在 Host 侧同步数据以及过程间组网的性能;MindSpore 通过复用 Parameter Server 模式训练架构,取代了 OpenMPI 能力。不过 Parameter Server 相干的文档及代码示例不够充沛。笔者尝试采纳此种形式进行训练,参考了官网文档、gitee 下面的测试用例,最终未能顺利完成整个 pipline。6. 本文总结原本重点介绍了在 GPU 硬件环境下,如何依赖 OpenMPI 进行多卡训练。对于非依赖 OpenMPI 的 Parameter Server 本文也有所波及,但因为官网文档的缺失和相应代码有余,无奈造成可行案例。7. 遇到问题 Parameter Server 模式下的官网文档跳跃性太大,相干的测试用例缺失两头过程代码,心愿可能欠缺这部分的文档和代码。8. 本文参考深度学习中的分布式训练 MindSpore 分布式并行总览 MindSpore 分布式并行训练根底样例(GPU)MindSpore Parameter Server 模式本文为原创文章,版权归作者所有,未经受权不得转载!

正文完
 0