关于字节跳动:Python3-cpython优化-实现解释器并行

本文介绍了对 cpython 解释器的并行优化，使其反对真正的多解释器并行执行的解决方案。

作者：字节跳动终端技术——谢俊逸

在业务场景中，咱们通过 cpython 执行算法包，因为 cpython 的实现，在一个过程内，无奈利用 CPU 的多个外围去同时执行算法包。对此，咱们决定优化 cpython，指标是让 cpython 高完成度的反对并行，大幅度的进步单个过程内 Python 算法包的执行效率。

在 2020 年，咱们实现了对 cpython 的并行执行革新，是目前业界首个 cpython3 的高完成度同时兼容 Python C API 的并行实现。

性能
- 单线程性能劣化 7.7%
- 多线程根本无锁抢占，多开一个线程缩小 44% 的执行工夫。
- 并行执行对总执行工夫有大幅度的优化
通过了 cpython 的单元测试
在线上曾经全量应用

cpython 是 python 官网的解释器实现。在 cpython 中，GIL，用于爱护对 Python 对象的拜访，从而避免多个线程同时执行 Python 字节码。GIL 防止出现竞争状况并确保线程平安。因为 GIL 的存在，cpython 是无奈真正的并行执行 python 字节码的. GIL 尽管限度了 python 的并行，然而因为 cpython 的代码没有思考到并行执行的场景，充斥着各种各样的共享变量，改变复杂度太高，官网始终没有移除 GIL。

在 Python 开源的 20 年里，Python 因为 GIL（全局锁）不能并行。目前支流实现 Python 并行的两种技术路线，然而始终没有高完成度的解决方案（高性能，兼容所有开源 feature, API 稳固）。次要是因为：

间接去除 GIL 解释器须要加许多细粒度的锁，影响单线程的执行性能，慢两倍。

Back in the days of Python 1.5, Greg Stein actually implemented a comprehensive patch set (the“free threading”patches) that removed the GIL and replaced it with fine-grained locking. Unfortunately, even on Windows (where locks are very efficient) this ran ordinary Python code about twice as slow as the interpreter using the GIL. On Linux the performance loss was even worse because pthread locks aren’t as efficient.

解释器状态隔离解释器外部的实现充斥了各种全局状态，革新繁琐，工作量大。

It has been suggested that the GIL should be a per-interpreter-state lock rather than truly global; interpreters then wouldn’t be able to share objects. Unfortunately, this isn’t likely to happen either. It would be a tremendous amount of work, because many object implementations currently have global state. For example, small integers and short strings are cached; these caches would have to be moved to the interpreter state. Other object types have their own free list; these free lists would have to be moved to the interpreter state. And so on.

这个思路开源有一个我的项目在做 multi-core-python, 然而目前曾经搁置了。目前只能运行非常简单的算术运算的 demo。对 Type 和许多模块的并行执行问题并没有解决，无奈在理论场景中应用。

为了实现最佳的执行性能，咱们参考 multi-core-python，在 cpython3.10 实现了一个高完成度的并行实现。

从全局解释器状态转换为每个解释器构造持有本人的运行状态（独立的 GIL，各种执行状态）。
反对并行，解释器状态隔离，并行执行性能不受解释器个数的影响（解释器间根本没有锁互相抢占）
通过线程的 Thread Specific Data 获取 Python 解释器状态。

在这套新架构下，Python 的解释器互相隔离，不共享 GIL，能够并行执行。充分利用古代 CPU 的多核性能。大大减少了业务算法代码的执行工夫。

解释器执行中应用了很多共享的变量，他们广泛以全局变量的模式存在. 多个解释器运行时，会同时对这些共享变量进行读写操作，线程不平安。

cpython 外部的次要共享变量：3.10 待处理的共享变量。大略有 1000 个 … 须要解决，工作量十分之大。

free lists
- MemoryError
- asynchronous generator
- context
- dict
- float
- frame
- list
- slice
singletons
- small integer ([-5; 256] range)
- empty bytes string singleton
- empty Unicode string singleton
- empty tuple singleton
- single byte character (b’\x00’to b’\xFF’)
- single Unicode character (U+0000-U+00FF range)
cache
- slide cache
- method cache
- bigint cache
- …
interned strings
PyUnicode_FromId static strings
….

如何让每个解释器独有这些变量呢？

cpython 是 c 语言实现的，在 c 中，咱们个别会通过参数中传递 interpreter\_state 构造体指针来保留属于一个解释器的成员变量。这种改法也是性能上最好的改法。然而如果这样改，那么所有应用 interpreter\_state 的函数都须要批改函数签名。从工程角度上是简直无奈实现的。

只能换种办法，咱们能够将 interpreter\_state 寄存到 thread specific data 中。interpreter 执行时，通过 thread specific key 获取到 interpreter\_state. 这样就能够通过 thread specific 的 API，获取到执行状态，并且不必批改函数的签名。

static inline PyInterpreterState* _PyInterpreterState_GET(void) {PyThreadState *tstate = _PyThreadState_GET();
#ifdef Py_DEBUG
    _Py_EnsureTstateNotNULL(tstate);
#endif
    return tstate->interp;
}

共享变量变为解释器独自持有咱们将所有的共享变量寄存到 interpreter_state 里。

    /* Small integers are preallocated in this array so that they
       can be shared.
       The integers that are preallocated are those in the range
       -_PY_NSMALLNEGINTS (inclusive) to _PY_NSMALLPOSINTS (not inclusive).
    */
    PyLongObject* small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];
    struct _Py_bytes_state bytes;
    struct _Py_unicode_state unicode;
    struct _Py_float_state float_state;
    /* Using a cache is very effective since typically only a single slice is
       created and then deleted again. */
    PySliceObject *slice_cache;

    struct _Py_tuple_state tuple;
    struct _Py_list_state list;
    struct _Py_dict_state dict_state;
    struct _Py_frame_state frame;
    struct _Py_async_gen_state async_gen;
    struct _Py_context_state context;
    struct _Py_exc_state exc_state;

    struct ast_state ast;
    struct type_cache type_cache;
#ifndef PY_NO_SHORT_FLOAT_REPR
    struct _PyDtoa_Bigint *dtoa_freelist[_PyDtoa_Kmax + 1];
#endif

通过 _PyInterpreterState_GET 快速访问。例如

/* Get Bigint freelist from interpreter  */
static Bigint **
get_freelist(void) {PyInterpreterState *interp = _PyInterpreterState_GET();
    return interp->dtoa_freelist;
}

留神，将全局变量改为 thread specific data 是有性能影响的，不过只有管制该 API 调用的次数，性能影响还是能够承受的。咱们在 cpython3.10 已有改变的的根底上，解决了各种各样的共享变量问题，3.10 待处理的共享变量

目前 cpython3.x 裸露了 PyType\_xxx 类型变量在 API 中。这些全局类型变量被第三方扩大代码以 &PyType\_xxx 的形式援用。如果将 Type 隔离到子解释器中，势必造成不兼容的问题。这也是官网改变停滞的起因，这个问题无奈以正当改变的形式呈现在 python3 中。只能等到 python4 批改 API 之后改掉。

咱们通过另外一种形式疾速的改掉了这个问题。

Type 是共享变量会导致以下的问题

Type Object 的 Ref count 被频繁批改，线程不平安
Type Object 成员变量被批改，线程不平安。

改法：

immortal type object.
应用频率低的不平安处加锁。
高频应用的场景，应用的成员变量设置为 immortal object.
1. 针对 python 的描述符机制，对理论应用时，类型的 property, 函数,classmethod,staticmethod,doc 生成的描述符也设置成 immortal object.

这样会导致 Type 和成员变量会内存透露。不过因为 cpython 有 module 的缓存机制，不清理缓存时，便没有问题。

咱们应用了 mimalloc 代替 pymalloc 内存池，在优化 1%-2% 性能的同时，也不须要额定解决 pymalloc。

官网 master 最新代码 subinterpreter 模块只提供了 interp_run_string 能够执行 code\_string. 出于体积和平安方面的思考，咱们曾经删除了 python 动静执行 code\_string 的性能。咱们给 subinterpreter 模块增加了两个额定的能力

interp\_call\_file 调用执行 python pyc 文件
interp\_call\_function 执行任意函数

python 中，咱们执行代码默认运行的是 main interpreter, 咱们也能够创立的 sub interpreter 执行代码，

interp = _xxsubinterpreters.create()
result = _xxsubinterpreters.interp_call_function(*args, **kwargs)

这里值得注意的是，咱们是在 main interpreter 创立 sub interpreter，随后在 sub interpreter 执行，最初把后果返回到 main interpreter. 这里看似简略，然而做了很多事件。

main interpreter 将参数传递到 sub interpreter
线程切换到 sub interpreter 的 interpreter_state。获取并转换参数
sub interpreter 解释执行代码
获取返回值，切换到 main interpreter
转换返回值
异样解决

这里有两个简单的中央：

interpreter state 状态的切换
interpreter 数据的传递

interp = _xxsubinterpreters.create()
result = _xxsubinterpreters.interp_call_function(*args, **kwargs)

咱们能够合成为

# Running In thread 11:
# main interpreter:
# 当初 thread specific 设置的 interpreter state 是 main interpreter 的
do some things ... 
create subinterpreter ...
interp_call_function ...
# thread specific 设置 interpreter state 为 sub interpreter state
# sub interpreter: 
do some thins ...
call function ...
get result ...
# 当初 thread specific 设置 interpreter state 为 main interpreter state
get return result ...

因为咱们解释器的执行状态是隔离的，在 main interpreter 中创立的 Python Object 是无奈在 sub interpreter 应用的. 咱们须要:

获取 main interpreter 的 PyObject 要害数据
寄存在一块内存中
在 sub interpreter 中依据该数据从新创立 PyObject

interpreter 状态的切换 & 数据的传递的实现能够参考以下示例 …

static PyObject *
_call_function_in_interpreter(PyObject *self, PyInterpreterState *interp, _sharedns *args_shared, _sharedns *kwargs_shared)
{
    PyObject *result = NULL;
    PyObject *exctype = NULL;
    PyObject *excval = NULL;
    PyObject *tb = NULL;
    _sharedns *result_shread = _sharedns_new(1);

#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS
    // Switch to interpreter.
    PyThreadState *new_tstate = PyInterpreterState_ThreadHead(interp);
    PyThreadState *save1 = PyEval_SaveThread();

    (void)PyThreadState_Swap(new_tstate);
#else
    // Switch to interpreter.
    PyThreadState *save_tstate = NULL;
    if (interp != PyInterpreterState_Get()) {
        // XXX Using the  head  thread isn't strictly correct.
        PyThreadState *tstate = PyInterpreterState_ThreadHead(interp);
        // XXX Possible GILState issues?
        save_tstate = PyThreadState_Swap(tstate);
    }
#endif
    
    PyObject *module_name = _PyCrossInterpreterData_NewObject(&args_shared->items[0].data);
    PyObject *function_name = _PyCrossInterpreterData_NewObject(&args_shared->items[1].data);

    ...
    
    PyObject *module = PyImport_ImportModule(PyUnicode_AsUTF8(module_name));
    PyObject *function = PyObject_GetAttr(module, function_name);
    
    result = PyObject_Call(function, args, kwargs);

    ...

#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS
    // Switch back.
    PyEval_RestoreThread(save1);
#else
    // Switch back.
    if (save_tstate != NULL) {PyThreadState_Swap(save_tstate);
    }
#endif
    
    if (result) {result = _PyCrossInterpreterData_NewObject(&result_shread->items[0].data);
        _sharedns_free(result_shread);
    }
    
    return result;
}

咱们曾经实现了外部的隔离执行环境，然而这是 API 比拟低级，须要封装一些高度形象的 API，进步子解释器并行的易用能力。

interp = _xxsubinterpreters.create()
result = _xxsubinterpreters.interp_call_function(*args, **kwargs)

这里咱们参考了，python concurrent 库提供的 thread pool, process pool, futures 的实现，本人实现了 subinterpreter pool. 通过 concurrent.futures 模块提供异步执行回调高层接口。

executer = concurrent.futures.SubInterpreterPoolExecutor(max_workers)
future = executer.submit(_xxsubinterpreters.call_function, module_name, func_name, *args, **kwargs)
future.context = context
future.add_done_callback(executeDoneCallBack)

咱们外部是这样实现的: 继承 concurrent 提供的 Executor 基类

class SubInterpreterPoolExecutor(_base.Executor):

SubInterpreterPool 初始化时创立线程，并且每个线程创立一个 sub interpreter

interp = _xxsubinterpreters.create()
t = threading.Thread(name=thread_name, target=_worker,
                     args=(interp, 
                           weakref.ref(self, weakref_cb),
                           self._work_queue,
                           self._initializer,
                           self._initargs))

线程 worker 接管参数，并应用 interp 执行

result = self.fn(self.interp ,*self.args, **self.kwargs)

针对 sub interpreter 的改变较大，存在两个隐患

代码可能存在兼容性问题，第三方 C /C++ Extension 实现存在全局状态变量，非线程平安。
python 存在着极少的一些模块.sub interpreter 无奈应用。例如 process

咱们心愿能对立对外的接口，让使用者不须要关注这些细节，咱们主动的切换调用形式。主动抉择在主解释器应用（兼容性好, 稳固）还是子解释器（反对并行，性能佳）

咱们提供了 C 和 python 的实现，不便业务方在各种场景应用，这里介绍下 python 实现的简化版代码。

在 bddispatch.py 中，形象了调用形式，提供对立的执行接口，对立解决异样和返回后果。bddispatch.py

def executeFunc(module_name, func_name, context=None, use_main_interp=True, *args, **kwargs):
    print(submit call  , module_name,  . , func_name)
    if use_main_interp == True:
        result = None
        exception = None
        try:
            m = __import__(module_name)
            f = getattr(m, func_name)
            r = f(*args, **kwargs)
            result = r
        except:
            exception = traceback.format_exc()
        singletonExecutorCallback(result, exception, context)

    else:
        future = singletonExecutor.submit(_xxsubinterpreters.call_function, module_name, func_name, *args, **kwargs)
        future.context = context
        future.add_done_callback(executeDoneCallBack)


def executeDoneCallBack(future):
    r = future.result()
    e = future.exception()
    singletonExecutorCallback(r, e, future.context)

对于性能要求高的场景，通过上述的形式，由主解释器调用子解释器去执行工作会减少性能损耗。这里咱们提供了一些 CAPI, 让间接内嵌 cpython 的应用方通过 C API 间接绑定某个解释器执行。

class GILGuard {
public:
    GILGuard() {inter_ = BDPythonVMDispatchGetInterperter();
        if (inter_ == PyInterpreterState_Main()) {printf( Ensure on main interpreter: %p\n , inter_);
        } else {printf( Ensure on sub interpreter: %p\n , inter_);
        }
        gil_ = PyGILState_EnsureWithInterpreterState(inter_);
        
    }
    
    ~GILGuard() {if (inter_ == PyInterpreterState_Main()) {printf( Release on main interpreter: %p\n , inter_);
        } else {printf( Release on sub interpreter: %p\n , inter_);
        }
        PyGILState_Release(gil_);
    }
    
private:
    PyInterpreterState *inter_;
    PyGILState_STATE gil_;
};

// 这样就能够主动绑定到一个解释器间接执行
- (void)testNumpy {
    GILGuard gil_guard;
    BDPythonVMRun(....);
}

字节跳动终端技术团队 (Client Infrastructure) 是大前端根底技术的全球化研发团队（别离在北京、上海、杭州、深圳、广州、新加坡和美国山景城设有研发团队），负责整个字节跳动的大前端基础设施建设，晋升公司全产品线的性能、稳定性和工程效率；反对的产品包含但不限于抖音、今日头条、西瓜视频、飞书、番茄小说等，在挪动端、Web、Desktop 等各终端都有深入研究。

团队目前 招聘 python 解释器优化方向的实习生，工作内容次要为优化 cpython 解释器，优化 cpythonJIT（自研），优化 cpython 罕用三方库。欢送分割微信: beyourselfyii。邮箱: xiejunyi.arch@bytedance.com

🔥 火山引擎 APMPlus 利用性能监控是火山引擎利用开发套件 MARS 下的性能监控产品。咱们通过先进的数据采集与监控技术，为企业提供全链路的利用性能监控服务，助力企业晋升异样问题排查与解决的效率。目前咱们面向中小企业特地推出「APMPlus 利用性能监控企业助力口头」，为中小企业提供利用性能监控免费资源包。当初申请，有机会取得 60 天 收费性能监控服务，最高可享6000 万条事件量。

👉 点击这里，立刻申请

关于字节跳动:Python3-cpython优化-实现解释器并行

背景

cpython 痛, GIL

挑战

新架构 - 多解释器架构

共享变量的隔离

Type 变量共享的解决，API 兼容性及解决方案

pymalloc 内存池共享解决

subinterperter 能力补全

subinterpreter 执行模型

interpreter state 状态的切换

interpreter 数据的传递

实现子解释器池

实现内部调度模块

间接绑定到子解释器执行

对于字节跳动终端技术团队