共计 1722 个字符,预计需要花费 5 分钟才能阅读完成。
问题
这是个线上问题. 某个服务节点在较低的 qps(每秒 2000 次数据库访问) 下, 在 worker 进程数 100, max_overflow 进程数 100 的情况下. 突然性能下降, 每秒只能处理 1500 次数据库访问. 导致请求处理延时从几 MS 上升至几百 MS, 之后又逐渐恢复.
原因
逐渐把范围缩小至 mongodb poolboy 进程池的 checkout:
check out
handle_call({checkout, CRef, Block}, {FromPid, _} = From, State) ->
#state{supervisor = Sup,
workers = Workers,
monitors = Monitors,
overflow = Overflow,
max_overflow = MaxOverflow} = State,
case Workers of
[Pid | Left] ->
MRef = erlang:monitor(process, FromPid),
true = ets:insert(Monitors, {Pid, CRef, MRef}),
{reply, Pid, State#state{workers = Left}};
[] when MaxOverflow > 0, Overflow < MaxOverflow ->
{Pid, MRef} = new_worker(Sup, FromPid),
true = ets:insert(Monitors, {Pid, CRef, MRef}),
{reply, Pid, State#state{overflow = Overflow + 1}};
[] when Block =:= false ->
{reply, full, State};
[] ->
MRef = erlang:monitor(process, FromPid),
Waiting = queue:in({From, CRef, MRef}, State#state.waiting),
{noreply, State#state{waiting = Waiting}}
end;
可以看到, 当 max_overflow 不为 0 时, 瞬间过载会创建新的 worker, 而这些 worker, 都会去链接 mongodb, 耗时 1 -2MS. 创建的消耗会阻塞 master process.
check in
而归还时, 又会将 worker 销毁, 导致链接一直创建 / 销毁, 而且都卡在 master process, 这导致所有的请求, 都会因 master process 的链接创建和销毁而阻塞, 导致 qps 雪崩下降.
handle_checkin(Pid, State) ->
#state{supervisor = Sup,
waiting = Waiting,
monitors = Monitors,
overflow = Overflow,
strategy = Strategy} = State,
case queue:out(Waiting) of
{{value, {From, CRef, MRef}}, Left} ->
true = ets:insert(Monitors, {Pid, CRef, MRef}),
gen_server:reply(From, Pid),
State#state{waiting = Left};
{empty, Empty} when Overflow > 0 ->
ok = dismiss_worker(Sup, Pid),
State#state{waiting = Empty, overflow = Overflow – 1};
{empty, Empty} ->
Workers = case Strategy of
lifo -> [Pid | State#state.workers];
fifo -> State#state.workers ++ [Pid]
end,
State#state{workers = Workers, waiting = Empty, overflow = 0}
end.
结论
不要使用 poolboy 的 max_overflow, 若创建 / 销毁 children process 时有一定消耗, 很容易阻塞 poolboy master 进程, 频繁创建 / 销毁 worker 导致雪崩.
每次查 BUG, 回头看来都是理所当然. 追查时却要费一番心思, 监控数据不便在个人 blog 给出. 不免省掉很多推断过程, 希望这个结论对大家有帮助.