关于源码分析:OneFlow源码解析OpKernel与解释器

撰文|郑建华
更新|赵露阳

持续追踪执行流程会发现，ReluFunctor 在结构 UserOpExpr 时会用到 UserOpRegistryMgr 治理的 Op 与 Kernel。Op 示意算子的形容信息，Kernel 在不同设施上实现计算。

注册信息保留在公有的 map 变量中。UserOpRegistryMgr 的头文件
（https://github.com/Oneflow-In…）中定义了 3 个宏，REGISTER_USER_OP、REGISTER_USER_OP_GRAD、REGISTER_USER_KERNEL 别离用于注册 op、grad_op、kernel。

1.1 ReluOp 的注册

REGISTER_USER_OP 负责 UserOp 的注册。通过检索代码能够找到这个宏的应用场景。ReluOp 相干的源代码在这 3 个文件中：

class 定义:
build/oneflow/core/framework/op_generated.h
注册 op、op 的局部实现:
build/oneflow/core/framework/op_generated.cpp
次要实现:
oneflow/oneflow/user/ops/relu_op.cpp

REGISTER_USER_OP宏在 op_generated.cpp 中开展后代码如下：

static UserOpRegisterTrigger<OpRegistry> g_register_trigger715 =
  ::oneflow::user_op::UserOpRegistryMgr::Get()
  .CheckAndGetOpRegistry("relu")
  .Input("x")
  .Output("y")
  .SetGetSbpFn(&ReluOp::GetSbp)
  .SetLogicalTensorDescInferFn(&ReluOp::InferLogicalTensorDesc)    .SetPhysicalTensorDescInferFn(&ReluOp::InferPhysicalTensorDesc)
  .SetDataTypeInferFn(&ReluOp::InferDataType);

调用流程如下：

CheckAndGetOpRegistry（https://github.com/Oneflow-In…）会创立一个 OpRegistry（https://github.com/Oneflow-In…）对象，这个类和 UserOpRegisterTrigger（https://github.com/Oneflow-In…）类一样，只是为结构 OpRegistryResult（https://github.com/Oneflow-In…）用的两头类型。

OpRegistry会暂存两头后果并在 Finish 中设置一些默认推导逻辑。UserOpRegisterTrigger的构造函数会调用注册逻辑。动态变量就是为了触发构造函数从而调用注册逻辑，将结构好的 OpRegistryResult 保留到 UserOpRegistryMgr（https://github.com/Oneflow-In…）（key 是 op_type，如relu）。

ReluOp 示意一个具体的 op_type，负责为 OpRegistryResult 提供 Op 特有的办法。

OpRegistryResult 把不同的 Op 形象为一个通用的构造（便于对立注册治理），次要蕴含形容信息，保留了 op 的输入输出形容，以及数据类型、sbp 等的推导逻辑函数。对于 relu 来说，次要是记录了几个推导函数要调用 ReluOp 的静态方法；op_def 次要蕴含 input/output 的名字。

1.2 ReluKernel 的注册

ReluKernel 在 relu_kernel.cpp 中注册，过程和 Op 的注册相似。REGISTER_USER_KERNEL宏产开后如下所示：

static UserOpRegisterTrigger<OpKernelRegistry> g_register_trigger0 =
  UserOpRegistryMgr::Get().
    CheckAndGetOpKernelRegistry("relu").
    .SetCreateFn(...)
    .SetIsMatchedHob(UnaryPrimitiveExists(ep::primitive::UnaryOp::kRelu, "y", "x"))
    .SetInplaceProposalFn([](const user_op::InferContext&,
                             const user_op::AddInplaceArgPair& AddInplaceArgPairFn) -> Maybe<void> {OF_RETURN_IF_ERROR(AddInplaceArgPairFn("y", 0, "x", 0, true));
      return Maybe<void>::Ok();});

留神 SetCreateFn 只是把一个如下的 lambda 表达式赋值给 result_.create_fn，这个字段很重要，后续执行就是通过它获取 kernel。

[]() {
    return user_op::NewOpKernel<UnaryPrimitiveKernel>("y", "x", [](user_op::KernelComputeContext* ctx) {const user_op::TensorDesc* src = ctx->TensorDesc4ArgNameAndIndex("x", 0);
            const user_op::TensorDesc* dst = ctx->TensorDesc4ArgNameAndIndex("y", 0);
            return ep::primitive::NewPrimitive<ep::primitive::ElementwiseUnaryFactory>(ctx->device_type(), ep::primitive::UnaryOp::kRelu, src->data_type(),
                dst->data_type());
        });
}

对于 relu 来说，NewOpKernel 就是 new 一个 UnaryPrimitiveKernel 对象并返回函数指针。最终注册的后果，会把 OpKernelRegistryResult 保留到 UserOpRegistryMgr（key 是 op_type_name，如 ”relu”）。1.3 Op 和 Kernel 注册相干的类关系图

上一篇提到，functional_api.yaml.cpp中的 functional::Relu 函数通过 find("Relu") 获取事后注册的 PackedFunctor<impl::ReluFunctor>，调用其 call 办法会执行impl::ReluFunctor。

ReluFunctor
（https://github.com/Oneflow-In…）的外围代码如下：

class ReluFunctor {
 public:
  ReluFunctor() { op_ = CHECK_JUST(one::OpBuilder("relu").Input("x", 1).Output("y", 1).Build()); }
  Maybe<Tensor> operator()(const std::shared_ptr<Tensor>& x, bool inplace) const {
    // 疏忽 inplace 相干逻辑
    return OpInterpUtil::Dispatch<Tensor>(*op_, {x});
  }
 private:
  std::shared_ptr<OpExpr> op_;
};

ReluFunctor
（https://github.com/Oneflow-In…）的构造函数中，次要是结构 UserOpExpr（https://github.com/Oneflow-In…）。

每一个 user op 通过 OpBuilder 的 Build() 后，都会生成相应的 UserOpExpr，用于存储属性、类型 /shape/ 设施等推导办法，用于接下来 op/kernel 的理论计算。UserOpExpr 蕴含以下成员：

base_attrs_
tensor_desc_infer_fn_
dtype_infer_fn_
device_and_stream_infer_fn_

它们别离用于存储该 user op 相干 attrs 属性、input/output tensor shape 推导办法、数据类型 data type 推导办法、设施及计算流推导办法等。除了罕用的 UserOpExpr、还有一些用于零碎 op 的 BuiltinOpExpr。

OpBuilder的 Input/Output 调用次要是操作 UserOpConf 的proto对象，Build函数内会批改 UserOpConf 对象，比方依据 OpRegistryResult::op_def 补充默认值到attr。

之后结构 UserOpExpr 对象，UserOpConf对象被保留到 UserOpExpr 的父类 BuiltinOpExprImpl<UserOpConf> 的op_proto_字段，对于 relu 来说，op_proto_次要保留 input, output 等信息。UserOpExpr初始化时会从 OpRegistryResult 拷贝函数变量。

ReluFunctor 执行的外围逻辑是调用 OpInterpUtil::Dispatch。调运程序如下：

整个链路很长，本篇笔记只以 Eager Local Mode 下，对次要执行流程做一些阐明。

3.1 依据环境和输出抉择解释器

Dispatch 调用的 GetInterpreter（https://github.com/Oneflow-In…）返回的是一个 AutogradInterpreter（https://github.com/Oneflow-In…）对象，这个类是在其内含的 OpExprInterpreter 成员变量根底之上减少了 autograd 的性能。GetIntrpreter内理论结构的是以下 3 种 Interpreter，在 Build 函数返回时转为AutogradInterpreter。

LazyInterpreter: 用于 lazy mode 下的分布式动态图执行模式
EagerLocalInterpreter: 用于 eager local mode 本地单卡执行模式（和 pytorch 单卡或 DDP 对齐）
EagerGlobalInterpreter: 用于 eager global mode，的分布式动态图执行模式

各个 Interpreter 的关系如下：

GetInterpreter的作用是依据输出和环境等信息，抉择一个适合的解释器。

接着在 Dispatch 中调用解释器的 AutogradInterpreter::Apply 办法，在这个办法内调用 internal_->Apply(…)（https://github.com/Oneflow-In…），也就是上述 3 个解释器的 Apply 办法。

3.2 Apply

通过下面咱们晓得，EagerLocalInterpreter、EagerGlobalnterpreter和 LazyInterpreter 都将为其包裹上AutogradInterpreter 的壳，通过 AutogradInterpreter 触发 Apply 的调用。顾名思义，AutogradInterpreter 的作用次要是和 autograd 相干，其次要为 eager mode 下前向的 op 节点插入对应的，用于反向计算 grad 的节点。

上面以最罕用的（Eager Mode）模式，解说 Apply 的执行办法。在 Eager Mode（无论是 eager local 还是 eager consistent）模式下，理论都会走到 EagerInterpreter 的 Apply（https://github.com/Oneflow-In…）办法：

Maybe<void> EagerInterpreter::Apply(const OpExpr& op_expr, const TensorTuple& inputs,
                                    TensorTuple* outputs, const OpExprInterpContext& ctx) const {#define APPLY_IF(op_type)                                              \
  if (const auto* op = dynamic_cast<const op_type##Expr*>(&op_expr)) { \
    return ApplyImpl(*op, inputs, outputs, ctx);                       \
  }

  APPLY_IF(UserOp);
  APPLY_IF(VariableOp);
  APPLY_IF(CastToLocalOp);
  APPLY_IF(CastFromLocalOp);
  APPLY_IF(GlobalToGlobalOp);
  APPLY_IF(CastToGlobalOp);
  APPLY_IF(CastFromGlobalOp);
  APPLY_IF(DistributeSplitOp);
  APPLY_IF(DistributeCloneOp);
  APPLY_IF(DistributeConcatOp);
  APPLY_IF(DistributeAddOp);
  APPLY_IF(FunctionOp);
  APPLY_IF(SelectTopNOp)
#undef APPLY_IF

  OF_UNIMPLEMENTED() << "The type" << op_expr.op_type_name()
                     << "has not been supported in EagerInterpreter::Apply.";
}

这里通过宏定义 APPLY_IF，减少了对不同类型 op 的分支解决，将 op_expr dynamic_cast 成相应子类 op 实现的 Expr，如对于大多数用户来说，用到的 op 都是 UserOp 类型，所以这里实际上会走到这个分支中：

if (const auto* op = dynamic_cast<const UserOpExpr*>(&op_expr)) {return ApplyImpl(*op, inputs, outputs, ctx);
}

再看看 EagerLocalInterpreter::ApplyImpl（https://github.com/Oneflow-In…）：

Maybe<void> EagerLocalInterpreter::ApplyImpl(const UserOpExpr& op_expr, const TensorTuple& inputs,
                                             TensorTuple* outputs,
                                             const OpExprInterpContext& ctx) const {return NaiveInterpret(op_expr, inputs, outputs, ctx);
}

其最终实现是 NaiveInterpret（https://github.com/Oneflow-In…）。

3.3 NaiveInterpret

NaiveInterpret 简略来说，次要用于做以下四件事：

check input tensor 的 device 是否统一
生成 output tensor
为 output tensor 推导和查看 shape/stride/dtype
构建 op 执行指令，并派发至 vm

简化版的代码如下：

Maybe<void> NaiveInterpret(const UserOpExpr& user_op_expr, const TensorTuple& inputs,
                           const Symbol<Device>& default_device, TensorTuple* outputs,
                           const OpExprInterpContext& ctx) {

  const auto& attrs = ctx.attrs;
  // 查看 input tensor 是否位于雷同 device 上
  ...
      
  // 推导 outout tensor 的设施类型
  // Infer devices
  if (!user_op_expr.has_device_and_stream_infer_fn()) {stream = JUST(GetDefaultStreamByDevice(default_device));
    for (int i = 0; i < outputs->size(); i++) {auto* tensor_impl = JUST(TensorImpl4Tensor(outputs->at(i)));
      *JUST(tensor_impl->mut_device()) = default_device;
    }
  } else {
    need_check_mem_case = false;
    stream = JUST(user_op_expr.InferDeviceAndStream(attrs, inputs, outputs));
  }

  // 推导 outout tensor 的形态、数据类型
  // Infer shapes and dtypes
  const auto& device_tag = stream->device()->type();
  JUST(user_op_expr.InferPhysicalTensorDesc(
      attrs, device_tag,
      [&](int32_t i) -> const TensorMeta* {return CHECK_JUST(TensorImpl4Tensor(inputs[i]))->mut_tensor_meta();},
      [&](int32_t i) -> TensorMeta* {
        // using thread_local TensorMeta pointer if inplace.
        // using tensor_impl TensorMeta pointer if not inplace.
        return output_tensor_metas->at(i);
      }));

  // 为 output tensor 初始化 eager_blob_object
  for (int i = 0; i < output_eager_blob_objects->size(); i++) {auto* tensor_impl = JUST(TensorImpl4Tensor(outputs->at(i)));
    if (!output_eager_blob_objects->at(i)) {if (!JUST(user_op_expr.SupportNonContiguous())) {std::shared_ptr<Stride> stride(new Stride(*tensor_impl->shape()));
        tensor_impl->mut_tensor_meta()->set_stride(stride);
      }
      const auto& dep_object = NewLocalDepObject();
      JUST(tensor_impl->InitEagerBlobObject(dep_object));
      output_eager_blob_objects->at(i) = JUST(tensor_impl->eager_blob_object());
    } else {
      // output i is inplaced.
      // check thread_local TensorMeta and tensor_impl TensorMeta.
      CHECK_OR_RETURN(tensor_impl->tensor_meta()->shape() == output_tensor_metas->at(i)->shape());
      CHECK_OR_RETURN(tensor_impl->tensor_meta()->dtype() == output_tensor_metas->at(i)->dtype());
    }
  }

  // 从 user_op_expr 中取出 kernel
  const auto& kernel = JUST(user_op_expr.MutKernel4Stream(stream));
  kernel->set_need_check_mem_case(need_check_mem_case);

  for (int64_t index : kernel->output_tuple_indexes4mut2_obns()) {output_eager_blob_objects->at(index)->set_is_shape_synced(false);
  }
  // kernel dispatch 至 VM，期待后续理论的调度执行
  JUST(PhysicalRun([&](InstructionsBuilder* builder) -> Maybe<void> {return builder->Call(kernel, input_eager_blob_objects, output_eager_blob_objects, ctx, stream);
  }));
  return Maybe<void>::Ok();}

PhysicalRun 承受一个 lambda functor 作为参数，这里即 InstructionsBuilder->Call 办法，该办法承受 kernel、input/output 的 eager blob object、kernel 执行的上下文作为参数。Call 办法理论会实现 OpCall 指令的构建，并最终将其派发至 vm 指令列表中，期待 VM 理论调度执行。

参考资料

OneFlow 学习笔记：Op 注册
（https://mp.weixin.qq.com/s/eF…）
从 Functor 到 OpExprInterpreter
https://github.com/Oneflow-In…
https://zhuanlan.zhihu.com/p/…

（本文经受权后公布，原文https://segmentfault.com/a/11…）

欢送下载体验 OneFlow v0.8.0 最新版本：
https://github.com/Oneflow-In…

关于源码分析:OneFlow源码解析OpKernel与解释器

Op 与 Kernel 的注册

UserOpExpr 的结构

Functor 的执行

Just My Socks（注册教程内含优惠码）

关于源码分析:OneFlow源码解析OpKernel与解释器

Op 与 Kernel 的注册

UserOpExpr 的结构

Functor 的执行

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）