关于深度学习:OneFlow源码解析Tensor类型体系与Local-Tensor

撰文｜郑建华
更新｜赵露阳

tensor和op是神经网络模型最根本的组件：op是模型的节点，tensor是连贯节点的边。然而，构建一个tensor并不仅仅是结构一个对象那么简略，至多要思考以下问题：

要反对节点本地的local tensor，以及分布式的global tensor；
要反对eager和lazy执行模式；
要反对不同的数据类型，包含float、double、int等；
要反对不同设施。

创立tensor的办法

与PyTorch相似，在OneFlow中也能够通过两种次要的形式来创立tensor：Tensor和tensor。这两种形式最终都会创立出OneFlow外部的C++ Tensor对象，即对应Python层的flow.Tensor类型。

1.1 Tensor

Python层的Tensor是在tensor.py（https://github.com/Oneflow-In...）中引入的，通过python c api注册的Tensor类型对象，此对象在MakeTensorType
（https://github.com/Oneflow-In...）中被定义和返回。

在MakeTensorType中次要通过PyTensorObject_init创立了Tensor对象：

static int PyTensorObject_init(PyObject* self, PyObject* args, PyObject* kwargs) {  HANDLE_ERRORS  auto* temp = functional::_legacy_tensor_ctor(NULL, args, kwargs);  if (PyErr_Occurred()) { throw py::error_already_set(); }  auto* _self = (PyTensorObject*)self;  _self->data = PyTensor_Unpack(temp);  _self->data->set_pyobject(self);  // reset temp data to prevent clearing the pyobject  // when the temp is deallocated  ((PyTensorObject*)temp)->data.reset();  Py_XDECREF(temp);  return 0;  END_HANDLE_ERRORS_RET(-1)}

通过functional::_legacy_tensor_ctor函数创立了OneFlow外部的c++ Tensor对象：oneflow::one::Tensor，并作为data绑定至Python的Tensor类型。在MakeTensorType中，还通过PyMethodDef（https://github.com/Oneflow-In...）为Tensor注册了很多C++办法，如：

ods[] = {    {"storage_offset", PyTensorObject_storage_offset, METH_NOARGS, NULL},    {"stride", PyTensorObject_stride, METH_NOARGS, NULL},    {"is_contiguous", PyTensorObject_is_contiguous, METH_NOARGS, NULL},    {"contiguous", PyTensorObject_contiguous, METH_NOARGS, NULL},    {"contiguous_", PyTensorObject_contiguous_, METH_NOARGS, NULL},    {"pin_memory", PyTensorObject_pin_memory, METH_NOARGS, NULL},    {"is_pinned", PyTensorObject_is_pinned, METH_NOARGS, NULL},    {"requires_grad_", (PyCFunction)PyTensorObject_requires_grad_, METH_VARARGS | METH_KEYWORDS,     NULL},    {"retain_grad", PyTensorObject_retain_grad, METH_NOARGS, NULL},    {"detach", PyTensorObject_detach, METH_NOARGS, NULL},    {"clone", PyTensorObject_clone, METH_NOARGS, NULL},    {"zero_", PyTensorObject_zero_, METH_NOARGS, NULL},    {"register_hook", PyTensorObject_register_hook, METH_O, NULL},    {"_register_post_grad_accumulation_hook", PyTensorObject__register_post_grad_accumulation_hook,     METH_O, NULL},    {"global_id", PyTensorObject_global_id, METH_NOARGS, NULL},    {"check_meta_consistency", PyTensorObject_check_meta_consistency, METH_NOARGS, NULL},    {"to_numpy", PyTensorObject_to_numpy, METH_NOARGS, NULL},    {"type", (PyCFunction)PyTensorObject_type, METH_VARARGS | METH_KEYWORDS, NULL},

此外，在Python层通过RegisterMethods（https://github.com/Oneflow-In...）也为Tensor注册了一些Python实现的Tensor办法或属性（如tensor.numpy），在OneFlow包初始化时会通过RegisterMethod4Class
（https://github.com/Oneflow-In...）实现这些Python办法和属性的注册。RegisterMethod4Class的调用流程如下：

相比于Python实现来说，Tensor的++实现的办法/属性通常具备较高的性能。

1.2 tensor函数

Tensor是类型，而tensor则是函数，flow.tensor函数在`
oneflow/api/python/functional/tensor_api.yaml
`中被定义：

- name: "tensor"  signature: [      "Tensor (PyObject* data, *, DataType dtype=None, Device device=None,      Bool requires_grad=False, Bool pin_memory=False) => TensorWithData",      "Tensor (PyObject* data, *, DataType dtype=None, Placement placement,      SbpList sbp, Bool requires_grad=False) => GlobalTensorWithData",    ]  bind_python: True

其C++实现位于tensor_api.yaml.pybind.cpp中，这是构建阶段主动生成的文件。

通过函数签名能够看到，flow.tensor()有两种重载的办法：

TensorWithData
GlobalTensorWithData

它们别离用于结构local tensor和global tensor的结构。和下面的Tensor相似，flow.tensor返回的也是OneFlow外部的oneflow::one::Tensor对象（绑定至Python的Tensor对象）。

1.3 手动构建tensor的两种形式

和PyTorch相似，在OneFlow中罕用创立tensor的形式也分为两种：

flow.Tensor
flow.tensor

创立形式示例：

import oneflowimport numpy as nponeflow.tensor([[1., -1.], [1., -1.]])# tensor([[ 1., -1.],#         [ 1., -1.]], dtype=oneflow.float32)oneflow.tensor(np.array([[1, 2, 3], [4, 5, 6]]))# tensor([[ 1, 2, 3],#         [ 4, 5, 6]], dtype=oneflow.int64)flow.Tensor([[1,2,3],[4,5,6]])

大多数状况下（和PyTorch相似的eager模式），能够通过指定device、dtype、shape等参数创立一般tensor（local tensor）；

多数状况下（如OneFlow特有的eager global、lazy模式），须要global tensor时，能够通过指定sbp和placement的形式间接创立global tensor，也可通过tensor.to_global的形式将一般tensor转换为global tensor，可参考：

oneflow.tensor （https://oneflow.readthedocs.i...）
global tensor
（https://docs.oneflow.org/mast...）

OneFlow的tensor类型体系

上述内容中介绍的oneflow外部的C++ Tensor对象，实际上其定义位于：oneflow/core/framework/tensor.h，是一个形象的Tensor类型。

其中LocalTensor即为一般的单卡视角下的Tensor（和PyTorch的Tensor相似）；GlobalTensor则为OneFlow所特有的全局视角下的Tensor（通常用于eager global模式或lazy模式下）。Tensor应用了Bridge模式，每个Tensor子类外部有一个TensorImpl字段，负责形象Tensor的理论实现：

local tensor的结构

咱们以flow.tensor([[1,2,3],[4,5,6]])为例，看一下tensor结构的过程。次要的流程如下：

在这个例子中，因为应用的是flow.tensor办法创立tensor（且为一般的local tensor）所以会用到在`
oneflow/api/python/functional/tensor_api.yaml
中定义的TensorWithData办法，其实现，是位于
oneflow/api/python/functional/tensor_api.cpp
`的TensorWithDataFunctor：

class TensorWithDataFunctor { public:  Maybe<Tensor> operator()(PyObject* data, const Optional<Symbol<DType>>& dtype,                           const Optional<Symbol<Device>>& device, const bool requires_grad,                           const bool pin_memory) const {    ...    if (PyTensor_Check(data)) {      // Throw warnings like pytorch.      auto ret = PyErr_WarnEx(          PyExc_UserWarning,          "To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() "          "or sourceTensor.clone().detach().requires_grad_(True), rather than "          "oneflow.tensor(sourceTensor).",          1);      if (ret != 0) { return Error::RuntimeError(); }      const auto& other = PyTensor_Unpack(data);      return MakeTensorFromOtherTensor(other, dtype, device, requires_grad, pin_memory);    } else {      // Make tensor from python sequence or numpy array.      return MakeLocalTensorFromData(data, dtype, device, requires_grad, pin_memory);    }  }};

因为这里传入的data是一个Python的list对象，所以最终会调用MakeLocalTensorFromData办法，创立tensor次要的逻辑都在这个函数中。其中大量调用Python和Numpy的接口，查看PyObject的数据类型，获取Shape
（https://github.com/Oneflow-In...）和DataType（https://github.com/Oneflow-In...），如果用户没有制订device，默认会设置为CPU设施（https://github.com/Oneflow-In...）。

前面次要是调用EmptyFunctor
（https://github.com/Oneflow-In...）和SwitchCopyLocalTensorFromUntypedArray（https://github.com/Oneflow-In...）。前者为tensor分配内存，后者进行数据拷贝，两个步骤都会通过虚拟机指令实现。其中EmptyFunctor会走一般的OpCall指令、而CopyLocalTensorFromUntypedArray会依据是否须要同步copy走到`
AccessBlobByCallback/SyncAccessBlobByCallback
`指令。

为什么要通过虚拟机指令实现呢？无论是内存资源的调配，还是数据拷贝，CPU和CUDA等不同设施上的操作都不一样。之前探讨Op/Kernel时曾经看到，在OneFlow中所有动动态图工作执行、eager模式下op/kernel执行、内存/显存的调配和开释、device、stream等对立由虚拟机进行治理。

3.1 分配内存：EmptyFunctor

matmul和relu（inplace=false时）等操作在执行过程中也会创立output tensor。之前探讨relu时重点关注了op和kernel的计算逻辑，而疏忽了tensor相干的内容。

而这里只须要先结构一个空tensor对象，不须要其它计算，所以是一个Empty操作，Empty op对应的kernel——EmptyKernel（https://github.com/Oneflow-In...）没有实质性的计算逻辑，只是先依据shape、dtype、device信息创立一个空tensor，期待后续将理论的数据从内存中copy至此空tensor，从而实现整个tensor的创立过程。

EmptyFunctor同样和其余functor一样，最终会被Dispacth至对应的interpreter被解释执行，这里因为是eager模式下的local tensor，EmptyFunctor最终会进入eager local interpreter，交给NaiveInterpret（https://github.com/Oneflow-In...）办法解决。流程如下：

在结构EagerLocalTensorImpl（https://github.com/Oneflow-In...）对象，用于寄存tensor后果。但这只是一个壳子，还没有为tensor的数据调配存储空间。
之后会初始化EagerBlobObject（https://github.com/Oneflow-In...）、TensorStorage（https://github.com/Oneflow-In...），这样tensor次要的字段根本构建结束
而后结构OpCall指令、提交虚拟机PhysicalRun（https://github.com/Oneflow-In...），期待vm的调度执行。

OpCall对应的指令策略最终会进入oneflow/core/vm/op_call_instruction_policy.cpp，并在Prepare办法中通过AllocateOutputBlobsMemory办法对TensorStorage实现理论的内存调配；在Compute办法中启动（empty op对应的）理论的kernel执行。

3.2 拷贝数据：SwitchCopyLocalTensorFromUntypedArray

SwitchCopyMirroredTensorFromUntypedArray其实是MAKE_SWITCH_ENTRY（https://github.com/Oneflow-In...）宏开展后的函数名。宏开展后的代码如下。理论会调用CopyLocalTensorFromUntypedArray（https://github.com/Oneflow-In...）。

template<typename... Args>static Maybe<void> SwitchCopyLocalTensorFromUntypedArray(    const std::tuple<DataType>& switch_tuple, Args&& ... args) {  static const std::map<std::tuple<DataType>, std::function<Maybe<void>(Args && ...)>>      case_handlers {          {SwitchCase(DataType::kFloat),           [](Args&&... args) {             return CopyLocalTensorFromUntypedArray<float>(std::forward<Args>(args)...);           }},           // ...      };  return case_handlers.at(switch_tuple)(std::forward<Args>(args)...);};

CopyLocalTensorFromUntypedArray办法如下：

template<typename T>Maybe<void> CopyLocalTensorFromUntypedArray(const std::shared_ptr<Tensor>& tensor,                                            PyObject* array) {  return CopyBetweenLocalTensorAndNumpy<T>(tensor, array, CopyFromNumpyArray, "mut",                                           /*block_host_until_done=*/false);}

其外部理论调用了CopyBetweenLocalTensorAndNumpy办法。

CopyBetweenLocalTensorAndNumpy

顾名思义，这个办法次要是用在numpy和tensor之间进行数据copy的。其中第3个参数：CopyFromNumpyArray理论是一个函数回调的callback办法，其次要通过SyncAutoMemcpy进行array和tensor(blob)之间的内存拷贝：

void CopyFromNumpyArray(ep::Stream* stream,                        const std::shared_ptr<vm::EagerBlobObject>& eager_blob_object,                        const NumPyArrayPtr& array_ptr) {  SyncAutoMemcpy(stream, eager_blob_object->mut_dptr(), array_ptr.data(),                 eager_blob_object->ByteSizeOfBlobBody(), eager_blob_object->mem_case(),                 memory::MakeHostMemCase());}

持续看CopyBetweenLocalTensorAndNumpy（https://github.com/Oneflow-In...）办法，其中最要害的是：

JUST(PhysicalRun([&](InstructionsBuilder* builder) -> Maybe<void> {      return builder->AccessBlobByCallback(          tensor,          [array_ptr, Copy](ep::Stream* stream,                            const std::shared_ptr<vm::EagerBlobObject>& eager_blob_object) {            Copy(stream, eager_blob_object, array_ptr);          },          modifier);    }));

通过InstructionsBuilder构建了AccessBlobByCallback指令，参数为下面通过EmptyFuncor创立的空tensor、callback的函数指针及参数、以及modifier（string "mut"示意可动静批改）。

AccessBlobByCallback

和OpCall相似，InstructionsBuilder调用AccessBlobByCallback时，也会理论结构对应的vm指令策略——AccessBlobArgCbInstructionPolicy并派发至vm，期待被调度和理论执行：

template<typename T>Maybe<void> InstructionsBuilder::AccessBlobByCallback(    const T tensor,    const std::function<void(ep::Stream*, const std::shared_ptr<vm::EagerBlobObject>&)>& callback,    const std::string& modifier) {  const std::shared_ptr<vm::EagerBlobObject>& eager_blob_object = JUST(tensor->eager_blob_object());  Symbol<Device> device = JUST(GetDevice(tensor));  ...  Symbol<Stream> stream = JUST(GetDefaultStreamByDevice(device));  JUST(SoftSyncStream({eager_blob_object}, stream));  auto instruction = intrusive::make_shared<vm::Instruction>(      // Never replace `stream` with producer_stream or last_used_stream.      JUST(Singleton<VirtualMachine>::Get()->GetVmStream(stream)),      std::make_shared<vm::AccessBlobArgCbInstructionPolicy>(eager_blob_object, callback,                                                             modifier));  instruction_list_->EmplaceBack(std::move(instruction));  return Maybe<void>::Ok();}

等该条AccessBlobArgCbInstructionPolicy指令理论执行时，会在指令的Compute（https://github.com/Oneflow-In...）办法中调用callback实现从tensor的blob <-> numpy的ndarray之间的数据copy，至此拷贝过程完结，flow.tensor的创立全副实现。

（本文经受权后公布。原文：https://segmentfault.com/a/11...）

参考资料

OneFlow源码：https://github.com/Oneflow-In...
OneFlow源码解析：Op、Kernel与解释器
OneFlow源码解析：算子指令在虚拟机中的执行

欢送下载体验 OneFlow v0.8.0 最新版本：https://github.com/Oneflow-In...