撰文|郑建华
更新|赵露阳
tensor和op是神经网络模型最根本的组件:op是模型的节点,tensor是连贯节点的边。然而,构建一个tensor并不仅仅是结构一个对象那么简略,至多要思考以下问题:
- 要反对节点本地的local tensor,以及分布式的global tensor;
- 要反对eager和lazy执行模式;
- 要反对不同的数据类型,包含float、double、int等;
- 要反对不同设施。
1
创立tensor的办法
与PyTorch相似,在OneFlow中也能够通过两种次要的形式来创立tensor:Tensor
和tensor
。这两种形式最终都会创立出OneFlow外部的C++ Tensor对象,即对应Python层的flow.Tensor类型。
1.1 Tensor
Python层的Tensor是在tensor.py(https://github.com/Oneflow-In...)中引入的,通过python c api注册的Tensor类型对象,此对象在MakeTensorType
(https://github.com/Oneflow-In...)中被定义和返回。
在MakeTensorType中次要通过PyTensorObject_init创立了Tensor对象:
static int PyTensorObject_init(PyObject* self, PyObject* args, PyObject* kwargs) { HANDLE_ERRORS auto* temp = functional::_legacy_tensor_ctor(NULL, args, kwargs); if (PyErr_Occurred()) { throw py::error_already_set(); } auto* _self = (PyTensorObject*)self; _self->data = PyTensor_Unpack(temp); _self->data->set_pyobject(self); // reset temp data to prevent clearing the pyobject // when the temp is deallocated ((PyTensorObject*)temp)->data.reset(); Py_XDECREF(temp); return 0; END_HANDLE_ERRORS_RET(-1)}
通过functional::_legacy_tensor_ctor
函数创立了OneFlow外部的c++ Tensor对象:oneflow::one::Tensor
,并作为data绑定至Python的Tensor类型。在MakeTensorType中,还通过PyMethodDef(https://github.com/Oneflow-In...)为Tensor注册了很多C++办法,如:
ods[] = { {"storage_offset", PyTensorObject_storage_offset, METH_NOARGS, NULL}, {"stride", PyTensorObject_stride, METH_NOARGS, NULL}, {"is_contiguous", PyTensorObject_is_contiguous, METH_NOARGS, NULL}, {"contiguous", PyTensorObject_contiguous, METH_NOARGS, NULL}, {"contiguous_", PyTensorObject_contiguous_, METH_NOARGS, NULL}, {"pin_memory", PyTensorObject_pin_memory, METH_NOARGS, NULL}, {"is_pinned", PyTensorObject_is_pinned, METH_NOARGS, NULL}, {"requires_grad_", (PyCFunction)PyTensorObject_requires_grad_, METH_VARARGS | METH_KEYWORDS, NULL}, {"retain_grad", PyTensorObject_retain_grad, METH_NOARGS, NULL}, {"detach", PyTensorObject_detach, METH_NOARGS, NULL}, {"clone", PyTensorObject_clone, METH_NOARGS, NULL}, {"zero_", PyTensorObject_zero_, METH_NOARGS, NULL}, {"register_hook", PyTensorObject_register_hook, METH_O, NULL}, {"_register_post_grad_accumulation_hook", PyTensorObject__register_post_grad_accumulation_hook, METH_O, NULL}, {"global_id", PyTensorObject_global_id, METH_NOARGS, NULL}, {"check_meta_consistency", PyTensorObject_check_meta_consistency, METH_NOARGS, NULL}, {"to_numpy", PyTensorObject_to_numpy, METH_NOARGS, NULL}, {"type", (PyCFunction)PyTensorObject_type, METH_VARARGS | METH_KEYWORDS, NULL},
此外,在Python层通过RegisterMethods(https://github.com/Oneflow-In...)也为Tensor注册了一些Python实现的Tensor办法或属性(如tensor.numpy),在OneFlow包初始化时会通过RegisterMethod4Class
(https://github.com/Oneflow-In...)实现这些Python办法和属性的注册。RegisterMethod4Class的调用流程如下:
相比于Python实现来说,Tensor的++实现的办法/属性通常具备较高的性能。
1.2 tensor函数
Tensor是类型,而tensor则是函数,flow.tensor
函数在`
oneflow/api/python/functional/tensor_api.yaml
`中被定义:
- name: "tensor" signature: [ "Tensor (PyObject* data, *, DataType dtype=None, Device device=None, Bool requires_grad=False, Bool pin_memory=False) => TensorWithData", "Tensor (PyObject* data, *, DataType dtype=None, Placement placement, SbpList sbp, Bool requires_grad=False) => GlobalTensorWithData", ] bind_python: True
其C++实现位于tensor_api.yaml.pybind.cpp
中,这是构建阶段主动生成的文件。
通过函数签名能够看到,flow.tensor()
有两种重载的办法:
- TensorWithData
- GlobalTensorWithData
它们别离用于结构local tensor和global tensor的结构。和下面的Tensor相似,flow.tensor返回的也是OneFlow外部的oneflow::one::Tensor
对象(绑定至Python的Tensor对象)。
1.3 手动构建tensor的两种形式
和PyTorch相似,在OneFlow中罕用创立tensor的形式也分为两种:
- flow.Tensor
- flow.tensor
创立形式示例:
import oneflowimport numpy as nponeflow.tensor([[1., -1.], [1., -1.]])# tensor([[ 1., -1.],# [ 1., -1.]], dtype=oneflow.float32)oneflow.tensor(np.array([[1, 2, 3], [4, 5, 6]]))# tensor([[ 1, 2, 3],# [ 4, 5, 6]], dtype=oneflow.int64)flow.Tensor([[1,2,3],[4,5,6]])
大多数状况下(和PyTorch相似的eager模式),能够通过指定device、dtype、shape等参数创立一般tensor(local tensor);
多数状况下(如OneFlow特有的eager global、lazy模式),须要global tensor时,能够通过指定sbp和placement的形式间接创立global tensor,也可通过tensor.to_global的形式将一般tensor转换为global tensor,可参考:
- oneflow.tensor (https://oneflow.readthedocs.i...)
- global tensor
(https://docs.oneflow.org/mast...)
2
OneFlow的tensor类型体系
上述内容中介绍的oneflow外部的C++ Tensor对象,实际上其定义位于:oneflow/core/framework/tensor.h,是一个形象的Tensor类型。
其中LocalTensor
即为一般的单卡视角下的Tensor(和PyTorch的Tensor相似);GlobalTensor
则为OneFlow所特有的全局视角下的Tensor(通常用于eager global模式或lazy模式下)。Tensor应用了Bridge模式,每个Tensor子类外部有一个TensorImpl字段,负责形象Tensor的理论实现:
3
local tensor的结构
咱们以flow.tensor([[1,2,3],[4,5,6]])为例,看一下tensor结构的过程。次要的流程如下:
在这个例子中,因为应用的是flow.tensor办法创立tensor(且为一般的local tensor)所以会用到在`
oneflow/api/python/functional/tensor_api.yaml中定义的TensorWithData办法,其实现,是位于
oneflow/api/python/functional/tensor_api.cpp
`的TensorWithDataFunctor:
class TensorWithDataFunctor { public: Maybe<Tensor> operator()(PyObject* data, const Optional<Symbol<DType>>& dtype, const Optional<Symbol<Device>>& device, const bool requires_grad, const bool pin_memory) const { ... if (PyTensor_Check(data)) { // Throw warnings like pytorch. auto ret = PyErr_WarnEx( PyExc_UserWarning, "To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() " "or sourceTensor.clone().detach().requires_grad_(True), rather than " "oneflow.tensor(sourceTensor).", 1); if (ret != 0) { return Error::RuntimeError(); } const auto& other = PyTensor_Unpack(data); return MakeTensorFromOtherTensor(other, dtype, device, requires_grad, pin_memory); } else { // Make tensor from python sequence or numpy array. return MakeLocalTensorFromData(data, dtype, device, requires_grad, pin_memory); } }};
因为这里传入的data是一个Python的list对象,所以最终会调用MakeLocalTensorFromData
办法,创立tensor次要的逻辑都在这个函数中。其中大量调用Python和Numpy的接口,查看PyObject的数据类型,获取Shape
(https://github.com/Oneflow-In...)和DataType(https://github.com/Oneflow-In...),如果用户没有制订device,默认会设置为CPU设施(https://github.com/Oneflow-In...)。
前面次要是调用EmptyFunctor
(https://github.com/Oneflow-In...)和SwitchCopyLocalTensorFromUntypedArray(https://github.com/Oneflow-In...)。前者为tensor分配内存,后者进行数据拷贝,两个步骤都会通过虚拟机指令实现。其中EmptyFunctor会走一般的OpCall
指令、而CopyLocalTensorFromUntypedArray会依据是否须要同步copy走到`
AccessBlobByCallback/SyncAccessBlobByCallback
`指令。
为什么要通过虚拟机指令实现呢?无论是内存资源的调配,还是数据拷贝,CPU和CUDA等不同设施上的操作都不一样。之前探讨Op/Kernel时曾经看到,在OneFlow中所有动动态图工作执行、eager模式下op/kernel执行、内存/显存的调配和开释、device、stream等对立由虚拟机进行治理。
3.1 分配内存:EmptyFunctor
matmul
和relu
(inplace=false
时)等操作在执行过程中也会创立output tensor。之前探讨relu时重点关注了op和kernel的计算逻辑,而疏忽了tensor相干的内容。
而这里只须要先结构一个空tensor对象,不须要其它计算,所以是一个Empty操作,Empty op对应的kernel——EmptyKernel(https://github.com/Oneflow-In...)没有实质性的计算逻辑,只是先依据shape、dtype、device信息创立一个空tensor,期待后续将理论的数据从内存中copy至此空tensor,从而实现整个tensor的创立过程。
EmptyFunctor同样和其余functor一样,最终会被Dispacth至对应的interpreter被解释执行,这里因为是eager模式下的local tensor,EmptyFunctor最终会进入eager local interpreter,交给NaiveInterpret(https://github.com/Oneflow-In...)办法解决。流程如下:
- 在结构EagerLocalTensorImpl(https://github.com/Oneflow-In...)对象,用于寄存tensor后果。但这只是一个壳子,还没有为tensor的数据调配存储空间。
- 之后会初始化EagerBlobObject(https://github.com/Oneflow-In...)、TensorStorage(https://github.com/Oneflow-In...),这样tensor次要的字段根本构建结束
- 而后结构OpCall指令、提交虚拟机PhysicalRun(https://github.com/Oneflow-In...),期待vm的调度执行。
OpCall对应的指令策略最终会进入oneflow/core/vm/op_call_instruction_policy.cpp
,并在Prepare
办法中通过AllocateOutputBlobsMemory
办法对TensorStorage实现理论的内存调配;在Compute
办法中启动(empty op对应的)理论的kernel执行。
3.2 拷贝数据:SwitchCopyLocalTensorFromUntypedArray
SwitchCopyMirroredTensorFromUntypedArray
其实是MAKE_SWITCH_ENTRY(https://github.com/Oneflow-In...)宏开展后的函数名。宏开展后的代码如下。理论会调用CopyLocalTensorFromUntypedArray(https://github.com/Oneflow-In...)。
template<typename... Args>static Maybe<void> SwitchCopyLocalTensorFromUntypedArray( const std::tuple<DataType>& switch_tuple, Args&& ... args) { static const std::map<std::tuple<DataType>, std::function<Maybe<void>(Args && ...)>> case_handlers { {SwitchCase(DataType::kFloat), [](Args&&... args) { return CopyLocalTensorFromUntypedArray<float>(std::forward<Args>(args)...); }}, // ... }; return case_handlers.at(switch_tuple)(std::forward<Args>(args)...);};
CopyLocalTensorFromUntypedArray办法如下:
template<typename T>Maybe<void> CopyLocalTensorFromUntypedArray(const std::shared_ptr<Tensor>& tensor, PyObject* array) { return CopyBetweenLocalTensorAndNumpy<T>(tensor, array, CopyFromNumpyArray, "mut", /*block_host_until_done=*/false);}
其外部理论调用了CopyBetweenLocalTensorAndNumpy
办法。
CopyBetweenLocalTensorAndNumpy
顾名思义,这个办法次要是用在numpy和tensor之间进行数据copy的。其中第3个参数:CopyFromNumpyArray
理论是一个函数回调的callback办法,其次要通过SyncAutoMemcpy进行array和tensor(blob)之间的内存拷贝:
void CopyFromNumpyArray(ep::Stream* stream, const std::shared_ptr<vm::EagerBlobObject>& eager_blob_object, const NumPyArrayPtr& array_ptr) { SyncAutoMemcpy(stream, eager_blob_object->mut_dptr(), array_ptr.data(), eager_blob_object->ByteSizeOfBlobBody(), eager_blob_object->mem_case(), memory::MakeHostMemCase());}
持续看CopyBetweenLocalTensorAndNumpy(https://github.com/Oneflow-In...)办法,其中最要害的是:
JUST(PhysicalRun([&](InstructionsBuilder* builder) -> Maybe<void> { return builder->AccessBlobByCallback( tensor, [array_ptr, Copy](ep::Stream* stream, const std::shared_ptr<vm::EagerBlobObject>& eager_blob_object) { Copy(stream, eager_blob_object, array_ptr); }, modifier); }));
通过InstructionsBuilder构建了AccessBlobByCallback
指令,参数为下面通过EmptyFuncor创立的空tensor、callback的函数指针及参数、以及modifier(string "mut"示意可动静批改)。
AccessBlobByCallback
和OpCall相似,InstructionsBuilder调用AccessBlobByCallback
时,也会理论结构对应的vm指令策略——AccessBlobArgCbInstructionPolicy
并派发至vm,期待被调度和理论执行:
template<typename T>Maybe<void> InstructionsBuilder::AccessBlobByCallback( const T tensor, const std::function<void(ep::Stream*, const std::shared_ptr<vm::EagerBlobObject>&)>& callback, const std::string& modifier) { const std::shared_ptr<vm::EagerBlobObject>& eager_blob_object = JUST(tensor->eager_blob_object()); Symbol<Device> device = JUST(GetDevice(tensor)); ... Symbol<Stream> stream = JUST(GetDefaultStreamByDevice(device)); JUST(SoftSyncStream({eager_blob_object}, stream)); auto instruction = intrusive::make_shared<vm::Instruction>( // Never replace `stream` with producer_stream or last_used_stream. JUST(Singleton<VirtualMachine>::Get()->GetVmStream(stream)), std::make_shared<vm::AccessBlobArgCbInstructionPolicy>(eager_blob_object, callback, modifier)); instruction_list_->EmplaceBack(std::move(instruction)); return Maybe<void>::Ok();}
等该条AccessBlobArgCbInstructionPolicy
指令理论执行时,会在指令的Compute(https://github.com/Oneflow-In...)办法中调用callback实现从tensor的blob <-> numpy的ndarray之间的数据copy,至此拷贝过程完结,flow.tensor
的创立全副实现。
(本文经受权后公布。原文:https://segmentfault.com/a/11...)
参考资料
- OneFlow源码:https://github.com/Oneflow-In...
- OneFlow源码解析:Op、Kernel与解释器
- OneFlow源码解析:算子指令在虚拟机中的执行
欢送下载体验 OneFlow v0.8.0 最新版本:https://github.com/Oneflow-In...