白话解读
离线 learning 部分
本质上是将任意时刻任意空间位置离散化为时空网格,根据派单记录(含参加调度但无单的司机)计算该时空网格到当天结束时刻的预期收入。
关键问题:怎么计算预期收入?
动态规划思路:假设总共有时刻区间为 [0, T);先计算 T - 1 时刻的所有网格的预期收入(此时未来收入为 0,只有当前收入),其本质就是计算当前收入的均值;然后计算 T - 2 时刻的所有网格的预期收入;…;以此类推
这样的话,就可以计算出每个时空网格到当天结束时刻的预期收入。
重点:为什么按照这个方式得到的值函数是合理的?
The resultant value function captures spatiotemporal patterns of both the demand side and the supply side. To make it clearer, asa special case, when using no discount and an episode-length of a day, the state-value function in fact corresponds to the expected revenue that this driver will earn on average from the current time until the end of the day.
在线 planning 部分
使用以下公式描述订单和司机之间的匹配度:
价格越高,匹配度越高
当前位置价值越大,匹配度越低
未来位置价值越大,匹配度越高
接驾里程,隐形表达,越大则预计送达时间越大,衰减系数越小,匹配度越低
使用 KM 算法求解匹配结果
评估方案
AB-test 方案
we adopted a customized A/B testing design thatsplits tra c according to large time slices (three or six hours). Forexample, a three-hour split sets the rst three hours in Day 1 to runvariant A and the next three hours for variant B. The order is thenreversed for Day 2. Such experiments will last for two weeks toeliminate the daily di erence. We select large time slices to observelong-term impacts generated by order dispatch approaches.
实际收益
the performance improvementbrought by the MDP method is consistent in all cities, with gains inglobal GMV and completion rate ranging from 0.5% to 5%. Consis-tent to the previous discoveries, the MDP method achieved its bestperformance gain in cities with high order-driver ratios. Meanwhile,the averaged dispatch time was nearly identical to the baselinemethod, indicating little sacrifice in user experience
Value function 可视化效果
如何包装为强化学习
将时空网格定义为 state;将派单和不派单定义为 action;将 state 的预期收入定义为状态值函数。
强化学习的目的是求解最优策略,也等价于求解最优值函数。派单场景的独特的地方是,建模的时候 agent 是每个司机,做决策的时候是平台决策,所以司机其实是没有策略的,或者说,通过派单机制,司机的策略被统一化为使平台的期望收入最大。因此在强化学习的框架下,可以将离线 learning 和在线 planning 认为是 policy iteration 的两个步骤,learning 是更新 value function,planning 是 policy update。然而,其实细想起来,还是有些勉强。