前言
最近降级了otp24,一个提供地位无关call entity的组件在调用的entity过程退出时。会呈现timeout。在追究后发现和erlang otp24的一个改良相干。
https://www.erlang.org/blog/m...
起因
实现简述
在call一个entity时,不是间接call一个pid,而是通过节点的router转发:
sequenceDiagram client->>+router: Request router->>+entity: Request entity-->>-router: Reply router-->>-client: Reply
做过一个改良,由被调用的entity,间接返回。
sequenceDiagram client->>+router: Request router->>+entity: Request entity-->>-client: Reply
在被调用的entity退出中时,须要由常驻的router返回client,entity在退出中。否则client会timeout。
sequenceDiagram client->>+router: Request entity->>+entity: terminate start Note over entity: 音讯没有机会被解决 router->>+entity: Request entity->>+entity: terminate end client->>+client: timeout
所以,router会monitor entity。在entity异样退出时,由router 返回client。问题也正呈现在这里。
问题根因
首先看erlang otp23的call代码:
gen.erl:202
do_call(Process, Label, Request, Timeout) when is_atom(Process) =:= false -> Mref = erlang:monitor(process, Process), %% OTP-21: %% Auto-connect is asynchronous. But we still use 'noconnect' to make sure %% we send on the monitored connection, and not trigger a new auto-connect. %% erlang:send(Process, {Label, {self(), Mref}, Request}, [noconnect]), receive {Mref, Reply} -> erlang:demonitor(Mref, [flush]), {ok, Reply}; {'DOWN', Mref, _, _, noconnection} -> Node = get_node(Process), exit({nodedown, Node}); {'DOWN', Mref, _, _, Reason} -> exit(Reason) after Timeout -> erlang:demonitor(Mref, [flush]), exit(timeout) end.
在otp23中,咱们的代码运行如下。
sequenceDiagram client->>+router: {Label, {client_pid, Mref}, Request} entity->>+entity: terminate start router->>+entity: Process.monitor ret Mref2 router->>+entity: {{client_pid, Mref}, {router_pid, Mref2}, Request} entity->>+entity: terminate end entity->>+router: {:DOWN, Mref2, down_type, entity_pid, reason} router->>+client: {:DOWN, Mref, down_type, router_pid, reason}
在音讯达到时,entity正好在退出中时,也能够将这个事件告诉client。
而otp24中,跨节点call如下:
gen.erl:223
do_call(Process, Label, Request, Timeout) when is_atom(Process) =:= false -> Mref = erlang:monitor(process, Process, [{alias,demonitor}]), Tag = [alias | Mref], %% OTP-24: %% Using alias to prevent responses after 'noconnection' and timeouts. %% We however still may call nodes responding via process identifier, so %% we still use 'noconnect' on send in order to try to send on the %% monitored connection, and not trigger a new auto-connect. %% erlang:send(Process, {Label, {self(), Tag}, Request}, [noconnect]), receive {[alias | Mref], Reply} -> erlang:demonitor(Mref, [flush]), {ok, Reply}; {'DOWN', Mref, _, _, noconnection} -> Node = get_node(Process), exit({nodedown, Node}); {'DOWN', Mref, _, _, Reason} -> exit(Reason) after Timeout -> erlang:demonitor(Mref, [flush]), receive {[alias | Mref], Reply} -> {ok, Reply} after 0 -> exit(timeout) end end.
那么在otp24中,变成了这样:
sequenceDiagram client->>+router: {Label, {client_pid, [alias | Mref]}, Request} entity->>+entity: terminate start router->>+entity: Process.monitor ret Mref2 router->>+entity: {{client_pid, [alias | Mref]}, {router_pid, Mref2}, Request} entity->>+entity: terminate end entity->>+router: {:DOWN, Mref2, down_type, entity_pid, reason} Note over client: client期待的是Mref而不是 [alias | Mref] router->>+client: {:DOWN, [alias | Mref], down_type, router_pid, reason} client->>+client: timeout
修复就不言而喻了。
总结
erlang 的 Process aliases 是一个很棒的批改。解决了调用远端timeout后,音讯才返回的问题。timeout后才收到的返回,将被drop掉。这样catch住timeout也是平安的了。
举荐浏览之前写的另一篇blog:
谈谈erlang的timeout.