关于erlang:otp24-erlang-Process-aliases-引发的一个bug

最近降级了 otp24，一个提供地位无关 call entity 的组件在调用的 entity 过程退出时。会呈现 timeout。在追究后发现和 erlang otp24 的一个改良相干。
https://www.erlang.org/blog/m…

在 call 一个 entity 时，不是间接 call 一个 pid，而是通过节点的 router 转发：

sequenceDiagram
    client->>+router: Request
    router->>+entity: Request
    entity-->>-router: Reply
    router-->>-client: Reply

做过一个改良，由被调用的 entity，间接返回。

sequenceDiagram
    client->>+router: Request
    router->>+entity: Request
    entity-->>-client: Reply

在被调用的 entity 退出中时，须要由常驻的 router 返回 client，entity 在退出中。否则 client 会 timeout。

sequenceDiagram
    client->>+router: Request
    entity->>+entity: terminate start
    Note over entity: 音讯没有机会被解决
    router->>+entity: Request
    entity->>+entity: terminate end
    client->>+client: timeout

所以，router 会 monitor entity。在 entity 异样退出时，由 router 返回 client。问题也正呈现在这里。

首先看 erlang otp23 的 call 代码：
gen.erl:202

do_call(Process, Label, Request, Timeout) when is_atom(Process) =:= false ->
    Mref = erlang:monitor(process, Process),

    %% OTP-21:
    %% Auto-connect is asynchronous. But we still use 'noconnect' to make sure
    %% we send on the monitored connection, and not trigger a new auto-connect.
    %%
    erlang:send(Process, {Label, {self(), Mref}, Request}, [noconnect]),

    receive
        {Mref, Reply} ->
            erlang:demonitor(Mref, [flush]),
            {ok, Reply};
        {'DOWN', Mref, _, _, noconnection} ->
            Node = get_node(Process),
            exit({nodedown, Node});
        {'DOWN', Mref, _, _, Reason} ->
            exit(Reason)
    after Timeout ->
            erlang:demonitor(Mref, [flush]),
            exit(timeout)
    end.

在 otp23 中，咱们的代码运行如下。

sequenceDiagram
    client->>+router: {Label, {client_pid, Mref}, Request}
    entity->>+entity: terminate start
    router->>+entity: Process.monitor ret Mref2
    router->>+entity: {{client_pid, Mref}, {router_pid, Mref2}, Request}
    entity->>+entity: terminate end
    entity->>+router: {:DOWN, Mref2, down_type, entity_pid, reason}
    router->>+client: {:DOWN, Mref, down_type, router_pid, reason}

在音讯达到时，entity 正好在退出中时，也能够将这个事件告诉 client。
而 otp24 中，跨节点 call 如下：
gen.erl:223

do_call(Process, Label, Request, Timeout) when is_atom(Process) =:= false ->
    Mref = erlang:monitor(process, Process, [{alias,demonitor}]),

    Tag = [alias | Mref],

    %% OTP-24:
    %% Using alias to prevent responses after 'noconnection' and timeouts.
    %% We however still may call nodes responding via process identifier, so
    %% we still use 'noconnect' on send in order to try to send on the
    %% monitored connection, and not trigger a new auto-connect.
    %%
    erlang:send(Process, {Label, {self(), Tag}, Request}, [noconnect]),

    receive
        {[alias | Mref], Reply} ->
            erlang:demonitor(Mref, [flush]),
            {ok, Reply};
        {'DOWN', Mref, _, _, noconnection} ->
            Node = get_node(Process),
            exit({nodedown, Node});
        {'DOWN', Mref, _, _, Reason} ->
            exit(Reason)
    after Timeout ->
            erlang:demonitor(Mref, [flush]),
            receive
                {[alias | Mref], Reply} ->
                    {ok, Reply}
            after 0 ->
                    exit(timeout)
            end
    end.

那么在 otp24 中，变成了这样：

sequenceDiagram
    client->>+router: {Label, {client_pid, [alias | Mref]}, Request}
    entity->>+entity: terminate start
    router->>+entity: Process.monitor ret Mref2
    router->>+entity: {{client_pid, [alias | Mref]}, {router_pid, Mref2}, Request}
    entity->>+entity: terminate end
    entity->>+router: {:DOWN, Mref2, down_type, entity_pid, reason}
    Note over client: client 期待的是 Mref 而不是 [alias | Mref]
    router->>+client: {:DOWN, [alias | Mref], down_type, router_pid, reason}
    client->>+client: timeout

修复就不言而喻了。

erlang 的 Process aliases 是一个很棒的批改。解决了调用远端 timeout 后，音讯才返回的问题。timeout 后才收到的返回，将被 drop 掉。这样 catch 住 timeout 也是平安的了。
举荐浏览之前写的另一篇 blog：
谈谈 erlang 的 timeout.

关于erlang:otp24-erlang-Process-aliases-引发的一个bug

前言

起因

实现简述

问题根因

总结