关于芯片:经验分享寒武纪-pytorchmlu-添加逐层算子方法

欢送关注我的公众号 [极智视界]，回复 001 获取 Google 编程标准

O_o >_< o_O O_o ~_~ o_O

本教程分享了在寒武纪设施上 pytorch-mlu 中增加逐层算子的办法。

pytorch-mlu 逐层模式中算子间数据传递和存储的根本单元是 tensor。pytorch-mlu 依据 tensor 中的 device 属性值将算子散发到不同设施。以 abs() 算子为例，在 dispatch 阶段会依据 input_tensor 的设施属性值将算子调用散发到具体设施，逻辑如下图所示：

Catch 通过注册增加 MLU 算子形式与 pytorch 源码解耦，上面介绍在 Catch 中增加 MLU 算子的具体步骤。

在 catch/torch_mlu/csrc/generated/aten_mlu_type_default.cpp 中注册算子：

.op(torch::RegisterOperators::options().schema("aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor")  // NOLINT 

  .impl_unboxedOnlyKernel<at::Tensor(const at::Tensor &, const at::Tensor &, at::Scalar), &AtenMluType::add>(at::TensorTypeId::MLUTensorId)
  
  aliasAnalysis(c10::AliasAnalysisKind::FROM_SCHEMA))

AtenMluType 和 AtenMluCustomType 是 Catch 模块中算子的入口。AtenMluType 类次要蕴含框架中的规范算子；而 AtenMluCustomType 类蕴含客制化的算子。依据算子属性抉择在 AtenMluType 还是 AtenMluCustomType 中增加相应算子申明和实现。

规范算子散发
在 catch/torch_mlu/csrc/aten/aten_mlu_type.h 和 catch/torch_mlu/csrc/aten/aten_mlu_type.cpp 中增加算子申明和实现：

aten_mlu_type.h
static at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
aten_mlu_type.cpp
at::Tensor AtenMluType::add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha){return OP_DISPATCH(add, self, other, alpha);
}

客制化算子散发

对于 MLU 特有算子，在 catch/torch_mlu/csrc/aten/aten_mlu_type.h和 catch/torch_mlu/csrc/aten/aten_mlu_custom_type.cpp 中增加算子申明和实现：

aten_mlu_type.h
static at::Tensor linear(const at::Tensor& input,
                         const at::Tensor& weight,
                         const at::Tensor& bias,
                         const at::Tensor& q_scale,
                         const at::Tensor& q_mode);
aten_mlu_custom_type.cpp
at::Tensor AtenMluCustomType::linear(const at::Tensor& input,
                                     const at::Tensor& weight,
                                     const at::Tensor& bias,
                                     const at::Tensor& q_scale,
                                     const at::Tensor& q_mode){return OP_DISPATCH(linear, input, weight, bias, q_scale, q_mode);
}

从 AtenMluType 和 AtenMluCustomType 中都会通过 OpMethods 下发到推理算子或训练算子。在 catch/torch_mlu/csrc/aten/operators/op_methods.h 和 catch/torch_mlu/csrc/aten/operators/op_methods.cpp 中增加算子申明和实现。OpMethods 中的实现局部为该算子的 CPU 实现。

op_methods.h
virtual at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
op_methods.cpp
at::Tensor OpMethods::add(const at::Tensor& self,
                          const at::Tensor& other,
                          at::Scalar alpha){auto input_cpu = self.cpu();
   auto other_cpu = other.cpu();
   auto output = at::add(input_cpu, other_cpu, alpha);
   return output.to(at::Device(at::Device::Type::MLU));
}

在 catch/torch_mlu/csrc/aten/operators/cnml_ops.h 和 catch/torch_mlu/csrc/aten/operators/cnml_ops.cpp 中增加推理算子申明和实现。

cnml_ops.h
at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
cnml_ops.cpp
at::Tensor CnmlOps::add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha){CNML_DISPATCH(add, cnml_add, self, other, alpha);  // CNML_DISPATCH 宏第一个参数是该接口名，第二个参数是 wrapper 个名字，其余
}

wrapper 是对算子 kernel 的封装，每个算子对应一个 wrapper。这里以 add 算子为例，增加 wrapper 如下所示：

cnml_kernel.h
at::Tensor cnml_add(const at::Tensor& input, const at::Tensor& other, at::Scalar alpha);
add.cpp
at::Tensor cnml_add(const at::Tensor& input, const at::Tensor& other, at::Scalar alpha_scalar){TORCH_CHECK(input.dim() >= 0 || other.dim() >= 0, "dimension not support");
  at::Tensor input_ = input;
  at::Tensor other_ = other;
  auto alpha_data = alpha_scalar.to<scalar_t>();
  if(alpha_data != 1){
    // scale_t
    other_ = cnml::ops::cnml_scale(other_, alpha_data, 0);
  }
  if(other_.dim() < 1 && other_.device().type() == c10::DeviceType::CPU){auto other_scalar = other_.item();
    return cnml_add_internal(input_, other_scalar);   // 调用 kernel
  }
  if(input_.dim() < 1 && input_.device().type() == c10::DeviceType::CPU){auto input_scalar = input_.item();
    return cnml_add_internal(other_, input_scalar);   // 调用 kernel
  }
  
  bool broadcast = input_.sizes() != other_.sizes();
  if(broadcast){auto broadcast_size = at::infer_size(input.sizes(), other.sizes());
    at::Tensor broadcast1 = cnml::ops::cnml_expand(input_, broadcast_size, false);
    at::Tensor broadcast2 = cnml::ops::cnml_expand(other_, broadcast_size, false);
    return cnml_add_internal(broadcast1, broadcast2);  // 调用 kernel
  }else{return cnml_add_internal(input_, other_);  // 调用 kernel
  }
  return cnml_add_internal(input_, other_);   // 调用 kernel
}

Wrapper 中通过调用 kernel 实现算子性能。示例中调用的是 cnml_add_internal。算子的具体实现次要通过调用 CNML 库的接口来实现，上面是 CNML 库的逻辑：

kernel 实现就是依照上述编程逻辑调用 CNML 库接口实现的，在 catch/torch_mlu/csrc/aten/operators/cnml/internal/cnml_internal.h 和 catch/torch_mlu/csrc/aten/operators/cnml/internal/add_internal/cpp 中增加 kernel 函数的申明和实现。

cnml_internal.h
at::Tensor cnml_add_internal(const at::Tensor& input1, const at::Tensor& input2);
add_internal.cpp
at::Tensor cnml_add_internal(const at::Tensor& input1, const at::Tensor& input2){auto output = at::native::empty_like(input1);
  // prepare input cnml tensor
  auto* input1_impl = getMluTensorImpl(input1);  // 获取 MluTensorImpl
  auto input1_cnml = input1_impl->CreateCnmlTensor(CNML_TENSOR, toCnmlDataType(input1.dtype()));  // 类型自适应：toCnmlDataType()
       
  auto* input2_impl = getMluTensorImpl(input2);
  auto input2_cnml = input2_impl->CreateCnmlTensor(CNML_TENSOR, toCnmlDataType(input2.dtype()));
      
  // prepare output cnml tensor
  auto* output_impl = getMluTensorImpl(output);
  auto output_cnml = output_impl->CreateCnmlTensor(CNML_TENSOR, toCnmlDataType(output.dtype()));
      
  // End the execution flow if not MLU device
  CHECK_MLU_DEVICE(output);
  
  // setup operator
  cnmlBaseOp_t add_op;
  TORCH_CNML_CHECK(cnmlCreateAddOp(&add_op, input1_cnml, input2_cnml, output_cnml));
  
  // return to JIT if running mode is fuse
  CHEXK_RETURN_TO_FUSE(add_op, output);
  
  // compile op
  TORCH_CNML_CHECK(cnmlCompileBaseOp(add_op, GET_CORE_VERSION, GET_CORE_NUMBER));
  
  auto queue = getCurQueue();
  TORCH_CNML_CHECK(cnmlComputeAddOpForward_V4(add_op,
                                              NULL,
                                              input1_impl->raw_mutable_data(),
                                              NULL,
                                              input2_impl->raw_mutable_data(),
                                              NULL,
                                              output_impl->raw_mutable_data(),
                                              queue,
                                              NULL));
   syncQueue(queue);
   TORCH_CNML_CHECK(cnmlDestroyBaseOp(&add_op));
   
  return output;
}

对 MLU 不反对算子的解决

对于 MLU 暂不反对的操作，输出数据将会拷贝到 CPU 上，而后调用 CPU 相干操作，使其在 CPU 上运行，最初再将输入后果拷会到 MLU 上。具体实现，能够查问 op_methods.cp，该文件在 catch/torch_mlu/csrc/aten/operators/ 目录下。

op_methods.cpp
at::Tensor OpMethods::add(const at::Tensor& self,
                          const at::Tensor& other,
                          at::Scalar alpha){auto input_cpu = self.cpu();
  auto other_cpu = other.cpu();
  auto output = at::add(input_cpu, other_cpu, alpha);
  return output.to(at::Device(at::Device::Type::MLU));
}

对于新增的算子在执行过程中抛出异样时，如果 CPU 上没有对应的算子操作，那么该操作无奈切换到 CPU 上运行；
Wrapper 个别以 cnml_算子名命名，kernel 个别以 cnml_ 算子名 _internal 命名

应用基于 python 的 unittest 模块编写算子单元测试。测试时需提供雷同的参数和输出数据，别离在 MLU 和 CPU 上执行算子，比照两者的输入后果。MLU 和 CPU 计算结果可能会有差别，个别状况下两者的相对误差在 2% 以内均是能够承受的。

def test_add(self):
  # "Tensor + Tensor" mode testing
  for shape1, shape2 in [((1,3,224,224),(1,3,224,224)),((2,30,80),(2,30,80)),((3,20),(3,20)),((10),(10))]:
    input1_cpu = torch.rand(shape1, dtype=torch.float)
    input2_cpu = torch.rand(shape2, dtype=torch.float)
    input1_mlu = input1_cpu.to(xm.mlu_device())
    input2_mlu = input2_cpu.to(xm.mlu_device())
    # 在 CPU 上计算
    output_cpu = input1_cpu + input2_cpu
    # 在 MLU 上计算
    output_mlu = input1_mlu + input2_mlu
    # 计算 MLU 的误差，并确保相对误差在 2% 以内
    self.assertTensorsEqual(output_cpu, output_mlu.cpu(), 0.02, use_MSE=True)

以上分享了在寒武纪设施 pytorch-mlu 中增加逐层算子的办法，并以 add() 算子为例进行了示例编写，心愿我的分享会对你的学习有一点帮忙。

【公众号传送】
《【教训分享】寒武纪 pytorch-mlu 增加逐层算子办法》

关于芯片:经验分享寒武纪-pytorchmlu-添加逐层算子方法

1、注册算子

2、算子散发

3、批改 OpMethods 基类

4、下发算子

5、增加 wrapper

6、增加 wrapper

7、算子测试