关于机器学习:MindSpore报错-Select-GPU-kernel-op-fail-Incompatible-data-type

1 报错形容
1.1 零碎环境
Hardware Environment(Ascend/GPU/CPU): GPU
Software Environment:
– MindSpore version (source or binary): 1.5.2
– Python version (e.g., Python 3.7.5): 3.7.6
– OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 根本信息
1.2.1 脚本
训练脚本是通过构建BatchNorm单算子网络，对Tensor做归一化解决。脚本如下：

01 class Net(nn.Cell):
02 def __init__(self):
03 super(Net, self).__init__()
04 self.batch_norm = ops.BatchNorm()
05 def construct(self,input_x, scale, bias, mean, variance):
06 output = self.batch_norm(input_x, scale, bias, mean, variance)
07 return output
08
09 net = Net()
10 input_x = Tensor(np.ones([2, 2]), mindspore.float16)
11 scale = Tensor(np.ones([2]), mindspore.float16)
12 bias = Tensor(np.ones([2]), mindspore.float16)
13 bias = Tensor(np.ones([2]), mindspore.float16)
14 mean = Tensor(np.ones([2]), mindspore.float16)
15 variance = Tensor(np.ones([2]), mindspore.float16)
16 output = net(input_x, scale, bias, mean, variance)
17 print(output)

1.2.2 报错
这里报错信息如下：

Traceback (most recent call last):
File “116945.py”, line 22, in <module>

output = net(input_x, scale, bias, mean, variance)

File “/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py”, line 407, in call

out = self.compile_and_run(*inputs)

File “/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py”, line 734, in compile_and_run

self.compile(*inputs)

File “/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py”, line 721, in compile

_cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)

File “/data2/llj/mindspores/r1.5/build/package/mindspore/common/api.py”, line 551, in compile

result = self._graph_executor.compile(obj, args_list, phase, use_vm, self.queue_name)

TypeError: mindspore/ccsrc/runtime/device/gpu/kernel_info_setter.cc:355 PrintUnsupportedTypeException] Select GPU kernel op[BatchNorm] fail! Incompatible data type!
The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ]
起因剖析

咱们看报错信息，在TypeError中，写到Select GPU kernel op[BatchNorm] fail! Incompatible data type!

The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ]，大略意思是GPU环境下，不反对以后输出的数据类型组合，并阐明了反对的数据类型组合是怎么的：全副为float32或者input_x为float16，其余为float32。查看脚本的输出发现全副为float16类型，因而报错。

2 解决办法
基于下面已知的起因，很容易做出如下批改：

01 class Net(nn.Cell):
02 def __init__(self):
03 super(Net, self).__init__()
04 self.batch_norm = ops.BatchNorm()
05 def construct(self,input_x, scale, bias, mean, variance):
06 output = self.batch_norm(input_x, scale, bias, mean, variance)
07 return output
08
09 net = Net()
10 input_x = Tensor(np.ones([2, 2]), mindspore.float16)
11 scale = Tensor(np.ones([2]), mindspore.float32)
12 bias = Tensor(np.ones([2]), mindspore.float32)
13 mean = Tensor(np.ones([2]), mindspore.float32)
14 variance = Tensor(np.ones([2]), mindspore.float32)
15
16 output = net(input_x, scale, bias, mean, variance)
17 print(output)
此时执行胜利，输入如下：

output: (Tensor(shape=[2, 2], dtype=Float16, value=
[[ 1.0000e+00, 1.0000e+00],
[ 1.0000e+00, 1.0000e+00]]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00, 0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00, 0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00, 0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00, 0.00000000e+00]))
3 总结
定位报错问题的步骤：

1、找到报错的用户代码行： 16 output = net(input_x, scale, bias, mean, variance);

2、依据日志报错信息中的关键字，放大剖析问题的范畴：The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ]

3、须要重点关注变量定义、初始化的正确性。

4 参考文档
4.1 BatchNorm算子API接口

关于机器学习:MindSpore报错-Select-GPU-kernel-op-fail-Incompatible-data-type

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

关于机器学习:MindSpore报错-Select-GPU-kernel-op-fail-Incompatible-data-type

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复