关于深度学习:MindSpore踩坑昇腾上的Cosine误差

这两天遇到一个很经典的issue，为啥说经典呢，因为这是一个可能体现框架研发人员和算法工程师认知重大错位的典型案例。先放链接，有趣味的同学能够去看下全过程。

【AICC】CosineDecayLR余弦学习率实现形式强转float32类型计算（目前测试源码只能用fp32），导致呈现负值学习率影响模型最初收敛！！！

写模型的老师(maybe 学生？)用了三个感叹号来表白心田的不爽。我先来简略形容一下这个问题：

P.Cos()(Tensor(math.pi, mstype.float32))

result: -1.000004

用于计算余弦函数的Cos算子，后果会有误差，失常状况下cos(pi) = -1.0，然而用MindSpore计算失去的后果多了一个-4e-6。个别状况下，如果在网络里用到Cos算子，其实影响也不大，然而。。。。这个问题是产生在CosineDecayLR，也就是利用余弦函数动静调整学习率，这时候，就会呈现问题：

P.Cos()(Tensor(math.pi, mstype.float32)) + 1.0

result: -4.053116e-06

这个时候影响就十分大了，入门常识问题，学习率不能为正数，否则梯度更新会是反方向。此外，个别如BERT这样的模型，学习率的数量级在1e-5左右，能够看到这个误差就会重大影响梯度降落了。

算子精度误差达标=性能正确？
既然问题很大，要怎么解决呢？我大略简述一下issue创建者和专家回复过程。

issue创建者：CosineDecayLR余弦学习率呈现负值学习率影响模型最初收敛！！！误差是-4.053116e-06。
专家：百万分之4的误差，满足算子正当的计算误差范畴
issue创建者：然而导致我学习率为负值之后，我模型梯度更新方向反了，loss从稳固逐步升高......
这也是我说这是典型的起因，从硬件芯片到驱动使能再到算子库和框架，其实每一个层级的角度是不同的，所以对于昇腾（或者说CANN）而言，算子的精度误差在正当范畴内，这个算子是能够验收公布的。这时候，如果默认其正确，而后交由下层封装（即MindSpore）也是没有问题的。

然而！！！深度学习框架的研发和测试如果没有充沛的背景常识（其实是常识），就会呈现这样的问题。

显然，算子精度误差达标，绝不会等价于性能正确，像CosineDecayLR这样的API应该正当验证边界条件可能触发的问题。再次再次再次吐槽一遍，AI框架开发者要有深度学习根底！

GPU和Ascend上的正/余弦函数误差解决
回到问题自身，既然余弦函数有误差，正弦函数也得看看。而后我又在GPU上跑了一下，发现一个乏味的景象。

Ascend：

P.Cos()(Tensor(math.pi, mstype.float32))

result: -1.000004

P.Sin()(Tensor(math.pi, mstype.float32))

result: 0.0

GPU:

P.Cos()(Tensor(math.pi, mstype.float32))

result: -1.0

P.Sin()(Tensor(math.pi, mstype.float32))

result: -8.7423e-08

这个后果就很回味无穷了，Ascend上Sin是是没有精度误差的，GPU刚好相同。为了确认不是MindSpore的问题，我又用Pytorch跑了一下：

torch.cos(torch.tensor(math.pi))

result: -1.

torch.sin(torch.tensor(math.pi))

result: -8.7423e-08

能够明确Pytorch同样存在误差，然而GPU上应该对Cos做了解决。思考到个别Cos的应用场景更多（构建网络、学习率甚至权重初始化），这个解决也就能够了解了。而Ascend上Sin是无误差的，与GPU刚好相同，不晓得是出于什么起因的思考。然而从MindSpore跨平台应用而言，同样的CosineDecayLR代码，在这个时候会造成微小差别是毫无疑问的。

CosineDecayLR的解决(躲避)计划
计划1
依据 @用什么名字没那么重要的倡议，间接clip数值更适合，不会呈现误差问题。

代码如下：

import mindspore.ops as P
import mindspore.common.dtype as mstype
from mindspore import context
from mindspore.nn.learning_rate_schedule import LearningRateSchedule

class CosineDecayLR(LearningRateSchedule):

def __init__(self, min_lr, max_lr, decay_steps):    super(CosineDecayLR, self).__init__()    if not isinstance(min_lr, float):        raise TypeError("For 'CosineDecayLR', the argument 'min_lr' must be type of float, "                        "but got 'min_lr' type: {}.".format(type(min_lr)))    if min_lr >= max_lr:        raise ValueError("For 'CosineDecayLR', the 'max_lr' should be greater than the 'min_lr', "                         "but got 'max_lr' value: {}, 'min_lr' value: {}.".format(max_lr, min_lr))    self.min_lr = min_lr    self.max_lr = max_lr    self.decay_steps = decay_steps    self.math_pi = math.pi    self.delta = 0.5 * (max_lr - min_lr)    self.cos = P.Cos()    self.min = P.Minimum()    self.max = P.Maximum()    self.cast = P.Cast()def construct(self, global_step):    p = self.cast(self.min(global_step, self.decay_steps), mstype.float32)    return self.min_lr + self.delta * self.max((1.0 + self.cos(self.math_pi * (p / self.decay_steps))), 0.0)

计划2
有了后面的剖析，其实从前端角度解决或躲避就比较简单了，既然Sin算子不会呈现误差，那就间接应用Sin代替Cos即可：

cos(a) = sin(a + pi/2)

公式也很简略，间接革新一下CosineDecayLR源码即可。

import mindspore.ops as P
import mindspore.common.dtype as mstype
from mindspore import context
from mindspore.nn.learning_rate_schedule import LearningRateSchedule

class CosineDecayLR(LearningRateSchedule):

def __init__(self, min_lr, max_lr, decay_steps):    super(CosineDecayLR, self).__init__()    if not isinstance(min_lr, float):        raise TypeError("For 'CosineDecayLR', the argument 'min_lr' must be type of float, "                        "but got 'min_lr' type: {}.".format(type(min_lr)))    if min_lr >= max_lr:        raise ValueError("For 'CosineDecayLR', the 'max_lr' should be greater than the 'min_lr', "                         "but got 'max_lr' value: {}, 'min_lr' value: {}.".format(max_lr, min_lr))    self.min_lr = min_lr    self.max_lr = max_lr    self.decay_steps = decay_steps    self.math_pi = math.pi    self.delta = 0.5 * (max_lr - min_lr)    self.cos = P.Cos()    self.sin = P.Sin()    self.min = P.Minimum()    self.cast = P.Cast()    self.is_ascend = context.get_context("device_target") == "Ascend"def construct(self, global_step):    p = self.cast(self.min(global_step, self.decay_steps), mstype.float32)    if self.is_ascend:        return self.min_lr + self.delta * (1.0 + self.sin(self.math_pi * (p / self.decay_steps + 0.5)))    return self.min_lr + self.delta * (1.0 + self.cos(self.math_pi * (p / self.decay_steps)))

通过实测，

P.Cos()(Tensor(math.pi, mstype.float32))

result: -1.000004

P.Sin()(Tensor(math.pi * (1 + 0.5), mstype.float32))

result: -0.9999996

尽管也有误差，然而不会呈现 cos(pi) + 1.0 < 0.0 的状况了，因而学习率不会呈现负值，梯度更新不会反向。然而精度问题还在，而且框架研发人员和算法工程师认知重大错位的问题值得更加器重。

以上。