RNN中的梯度消失和梯度爆炸

$$\begin{equation}S_t = f(UX_t+WS_{t-1})\tag{1}\end{equation}$$
$$\begin{equation}O_t=g(VS_t)\tag{2}\end{equation}$$
其中，$f$ 和 $g$ 为激活函数，$U,W,V$ 为 RNN 的参数。
假如 $T$ 时刻的 loss 为 $L_T$，则反向流传时传递到 $t$ 时刻的对于 $W$ 的梯度为，
$$\begin{equation}[\frac{\partial L_T}{\partial W}]_t^T=\frac{\partial L_T}{\partial O_T}\frac{\partial O_T}{\partial S_T}(\Pi_{k=T-1}^{t-1}\frac{\partial S_{k+1}}{\partial S_k})\tag{3}\end{equation}$$

求 $S_k$ 对于 $S_{k-1}$ 的偏导（对于矩阵的求导，能够参考矩阵求导），
$$\begin{equation}\frac{\partial S_k}{\partial S_{k-1}}=\frac{\partial f(UX_k+WS_{k-1})}{\partial (UX_k+WS_{k-1})}\frac{\partial (UX_k+WS_{k-1})}{\partial S_{k-1}}=diag(f^{‘}(UX_k+WS_{k-1}))W\tag{4}\end{equation}$$

RNN 罕用的两种激活函数，sigmoid 和 tanh。如果抉择 sigmoid 函数作为激活函数，即 $f(z)=\frac{1}{1+e^{-z}}$，其导数为 $f^{‘}(z)=f(z)(1-f(z))$，导数的取值范畴为 0~0.25。如果抉择 tanh 作为激活函数，即 $f(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$，其导数为 $f^{‘}(z)=1-f(z)^2$，导数的取值范畴为 0~1。并且 RNN 网络的同一层中，所有工夫步的 $W$ 都是共享的。
因而，入选用 sigmoid 或者 tanh 作为激活函数时，大多数时候 $f^{‘}(z)$ 都是大于 0 小于 1 的，如果 W 的值也是大于 0 小于 1，在计算式 (3) 波及到屡次 $f^{‘}(z)$ 和 $W$ 的屡次连乘，后果会趋于 0，从而造成了 梯度隐没 的问题。如果 W 的值特地大，则屡次连乘后就会呈现 梯度爆炸 的问题。
如果应用 relu 作为激活函数，即 $f(z)=max(0,z)$，relu 的导数在 x >0 时恒为 1，肯定水平上能够 缓解梯度隐没 问题，然而如果 W 的值特地大，也会呈现 梯度爆炸 的问题。而且当 $z<0$ 时，导数恒为 0，会造成局部神经元无奈激活（可通过设置小学习率局部解决）。

DNN 中梯度隐没和 RNN 梯度隐没意义不一样，DNN 中梯度隐没的问题是：梯度在反向流传过程中，流传到低层的网络时，梯度会变得很小，这样低层网络的参数就不会更新，但高层网络是更新的 ；而 RNN 中的梯度隐没的问题是：工夫步 $T$ 的梯度无奈传递到工夫步 $t$（$t<T$ 且 $t$ 与 $T$ 时刻相差较大），因而工夫步 $t$ 在更新参数时，只会受到工夫步 $T^{‘}$ 的影响（$t\leq T^{‘}$，$t$ 与 $T^{‘}$ 相差不大），即 RNN 中的参数还是能够更新的，因为 RNN 中同一层的参数都是一样的，并不会呈现参数不更新的状况，然而没方法满足学习到长期依赖。

如图，在工夫步 1，更新参数时，依照公式 3，应该思考梯度
$$\begin{equation}\sum_{T=1}^{6}[\frac{\partial L_T}{\partial W}]_1^T\tag{5}\end{equation}$$
然而通过多层反向流传后，梯度 $[\frac{\partial L_6}{\partial W}]_1^6$ 和 $[\frac{\partial L_5}{\partial W}]_1^5$ 可能会隐没，因而工夫步 1 的在更新参数 $W$ 时，只思考了工夫步 1、2、3、4 这几个间隔它比拟近的工夫步，这也就导致了 学习不到远距离的依赖关系。这与 DNN 的梯度隐没是不同的，因为 RNN 中的参数 $W$ 还是能够更新的。因为，RNN 同一层的参数是一样的，而 MLP/CNN 中不同的层有不同的参数。最终，参数 $W$ 更新的梯度为各个工夫步参数 $W$ 的更新梯度之和。

在原始 RNN 的根底上，LSTM 和 GRU 被提出，通过引入门控机制，肯定水平上缓解了梯度隐没的问题。引入门控的目标在于将激活函数导数的连乘变为加法。
以 LSTM 为例，

$$\begin{equation}c^{(t)}=f^{(t)}\odot c^{(t-1)}+i^{(t)}\odot\tilde{c}^{(t-1)}\tag{6}\end{equation}$$
$$\begin{equation}h^{(t)}=o^{(t)}\odot \tanh(c^{(t)})\tag{7}\end{equation}$$
假如时刻 T 的损失 $L_T$，思考 $L_T$ 对 $c^{(t)}$ 求导，由式 (6) 和(7)可知有两条求导门路，别离为 $L_T->c^{(t+1)}->c^{(t)}$ 和 $L_T->h^{(t)}->c^{(t)}$。即，
$$\begin{equation}\begin{aligned}\frac{\partial L_T}{\partial c^{(t)}}&=\frac{\partial L_T}{\partial c^{(t+1)}}\frac{\partial c^{(t+1)}}{\partial c^{(t)}}+\frac{\partial L_T}{\partial h^{(t)}}\frac{\partial h^{(t)}}{\partial c^{(t)}}\\&=\frac{\partial L_T}{\partial c^{(t+1)}}\odot f^{(t+1)}+\frac{\partial L_T}{\partial h^{(t)}}\odot o^{(t)}\odot (1-\tanh^2(c^{(t)}))\end{aligned}\tag{8}\end{equation}$$

$$\begin{equation}[\frac{\partial L_T}{\partial W_f}]_t=\frac{\partial L_T}{\partial c^{(t)}}\frac{\partial c^{(t)}}{W_f}\tag{9}\end{equation}$$
$$\begin{equation}[\frac{\partial L_T}{\partial W_i}]_t=\frac{\partial L_T}{\partial c^{(t)}}\frac{\partial c^{(t)}}{W_i}\tag{10}\end{equation}$$
留神到，当 $f^{(t)}$ 为 1 时，即便第二项很小，t+ 1 时刻的梯度依然能够很好地传导到上一时刻 t。此时即便序列的长度很长，也不会产生梯度隐没的问题。当 $f^{(t)}$ 为 0 时，即 t 时刻的 cell 信息不会影响到 t + 1 时刻的信息，此时在反向流传过程中，t+ 1 时刻的梯度也不会传导到 t 时刻。因而 forget gate $f^{(t)}$ 起到了管制梯度流传的瘦弱水平的作用。
多层 LSTM 个别只采纳 2~3 层。

LSTM 中梯度的流传有很多条门路，例如 $c^{(t+1)}->c^{(t)}$ 这条门路上只有逐元素相乘和相加的操作，梯度流最稳固；然而其余门路，例如 $c^{(t+1)}->i^{(t+1)}->h^{(t)}->c^{(t)}$ 门路上梯度流与一般 RNN 相似，照样会产生雷同的权重矩阵和激活函数的导数的重复连乘，因而仍然会爆炸或者隐没。然而，正如式 (8)~(9) 所示，在计算 $T$ 时刻的损失传递到 $t$ 时刻对于 $W_f$ 和 $W_i$ 的梯度时，具备多个梯度流，且模式相似于
$$\begin{equation}(a_1+a_2)(b1+b2+b3)(c_1+c_2)…\tag{10}\end{equation}$$
即在反向流传过程中，梯度流是一种和的乘积的模式，因而能够了解为 总的远距离梯度 = 各条门路的远距离梯度之和，即使其余远距离门路梯度隐没了，只有保障有一条远距离门路梯度不隐没，总的远距离梯度就不会隐没（失常梯度 + 隐没梯度 = 失常梯度）。因而 LSTM 肯定水平上缓解了梯度隐没的问题，然而梯度爆炸问题任然可能产生，因为失常梯度 + 爆炸梯度 = 爆炸梯度。然而因为 LSTM 的梯度流门路十分起伏，且和一般 RNN 相比多通过了很屡次激活函数（导数都小于 1），因而 LSTM 产生梯度爆炸的频率要低得多。实际中梯度爆炸个别通过梯度裁剪来解决。

为了解决梯度隐没和梯度爆炸问题，IndRNN 将层内的神经元独立开来，对式 (1) 稍加批改，

$$\begin{equation}h^{(t)}=\sigma(Wx^{(t)} + u\odot h^{(t-1)}+b)\tag{11}\end{equation}$$
其中，激活函数 $f$ 为 relu 函数。IndRNN 中，在利用上一时刻 t - 1 时刻的 hidden state$h^{(t-1)}$ 计算以后时刻 t 的的 hidden state $h^{(t)}$ 时，不再是与权重矩阵 $U$ 相乘，而是与权重向量 $u$ 计算哈达玛积（对应元素相乘），这就使得同一层的 RNN Cell 的神经元 互相独立 了。即 $h^{(t)}$ 的第 k 个维度只与 $h^{(t-1)}$ 的第 k 个维度无关。

将这种神经元之间解耦的思维利用到 LSTM，进一步提出了 IndyLSTM。

$$\begin{equation}f^{(t)} =\sigma_g(W_fx^{(t)}+u_f\odot h^{(t-1)}+b_f)\tag{12}\end{equation}$$
$$\begin{equation}i^{(t)}=\sigma_g(W_i x^{(t)}+u_i\odot h^{(t-1)}+b_i)\tag{13}\end{equation}$$
$$\begin{equation}o^{(t)}=\sigma_g(W_o x^{(t)}+u_o\odot h^{(t-1)}+b_o)\tag{14}\end{equation}$$
$$\begin{equation}\tilde{c}^{(t)}=\sigma_c(W_c x^{(t)}+u_c\odot h^{(t-1)}+b_c)\tag{15}\end{equation}$$
$$\begin{equation}c^{(t)}=f^{(t)}\odot c^{(t-1)}+i^{(t)}\odot \tilde{c}^{(t)}\tag{16}\end{equation}$$
$$\begin{equation}h^{(t)}=o^{(t)}\odot \sigma_h(c^{(t)})\tag{17}\end{equation}$$
对神经元进行解耦，使得在反向流传过程中，多条门路的梯度流都较为安稳，能够无效地缓解梯度降落问题和梯度爆炸问题。

tensorflow 中，能够通过 tensorflow.nn.rnn_cell.LSTMCell 调用 LSTM，通过 tensorflow.contrib.rnn.IndyLSTMCell 调用 IndyLSTM。

输出门 $i$ 的权重参数为 $W_i$ 和 $U_i$，忘记门 $f$ 的权重参数为 $W_f$ 和 $U_f$，输入门 $o$ 的参数为 $W_o$ 和 $U_o$，候选 cell$\tilde{c}$ 的参数为 $W_c$ 和 $U_c$。因而总的权重参数 _kernel 的 shape 为[input_depth + h_depth, 4 * self._num_units]。

     self._kernel = self.add_variable(
        _WEIGHTS_VARIABLE_NAME,
        shape=[input_depth + h_depth, 4 * self._num_units],
        initializer=self._initializer,
        partitioner=maybe_partitioner)

在 call 办法中，将以后工夫步的 inputs 和上一时刻的 hidden state $h$ 拼接，与权重矩阵相乘，在切分，失去输出门、候选的 cell、忘记门和输入门。

     # i = input_gate, j = new_input, f = forget_gate, o = output_gate
    lstm_matrix = math_ops.matmul(array_ops.concat([inputs, m_prev], 1), self._kernel)
    lstm_matrix = nn_ops.bias_add(lstm_matrix, self._bias)
 
    i, j, f, o = array_ops.split(value=lstm_matrix, num_or_size_splits=4, axis=1)

由式 (12)~(15) 可知，神经元进行理解耦，参数 $W$ 任然为对于输出的权重矩阵，然而对于 $h$ 的权重矩阵 $U$ 变成了权重向量 $u$。_kernel_w的 shape 为 [input_depth, 4 self._num_units]，而权重向量_kernel_u 的 shape 为[1, 4 self._num_units]。

     self._kernel_w = self.add_variable(
        "%s_w" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
        shape=[input_depth, 4 * self._num_units],
        initializer=self._kernel_initializer)
    self._kernel_u = self.add_variable(
        "%s_u" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
        shape=[1, 4 * self._num_units],

gen_array_ops.tile(h, [1, 4]) * self._kernel_u即是在计算 $u_i\odot h$、$u_c\odot h$、$u_f\odot h$ 和 $u_o\odot h$。

     gate_inputs = math_ops.matmul(inputs, self._kernel_w)
    gate_inputs += gen_array_ops.tile(h, [1, 4]) * self._kernel_u
    gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)
 
    # i = input_gate, j = new_input, f = forget_gate, o = output_gate
    i, j, f, o = array_ops.split(value=gate_inputs, num_or_size_splits=4, axis=one)

https://www.zhihu.com/questio…
https://zhuanlan.zhihu.com/p/…
https://www.zhihu.com/questio…

RNN中的梯度消失和梯度爆炸

传统模式的 RNN

反向流传

梯度隐没和梯度爆炸的起因

不同之处

LSTM

IndRNN

源码

LSTMCell 的 build 办法

LSTMCell 的 call 办法

IndyLSTMCell 的 build 办法

IndyLSTMCell 的 call 办法

参考

Just My Socks（注册教程内含优惠码）

	self._kernel = self.add_variable(
	_WEIGHTS_VARIABLE_NAME,
	shape=[input_depth + h_depth, 4 * self._num_units],
	initializer=self._initializer,
	partitioner=maybe_partitioner)

	# i = input_gate, j = new_input, f = forget_gate, o = output_gate
	lstm_matrix = math_ops.matmul(array_ops.concat([inputs, m_prev], 1), self._kernel)
	lstm_matrix = nn_ops.bias_add(lstm_matrix, self._bias)

	i, j, f, o = array_ops.split(value=lstm_matrix, num_or_size_splits=4, axis=1)

	self._kernel_w = self.add_variable(
	"%s_w" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
	shape=[input_depth, 4 * self._num_units],
	initializer=self._kernel_initializer)
	self._kernel_u = self.add_variable(
	"%s_u" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
	shape=[1, 4 * self._num_units],

	gate_inputs = math_ops.matmul(inputs, self._kernel_w)
	gate_inputs += gen_array_ops.tile(h, [1, 4]) * self._kernel_u
	gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)

	# i = input_gate, j = new_input, f = forget_gate, o = output_gate
	i, j, f, o = array_ops.split(value=gate_inputs, num_or_size_splits=4, axis=one)

RNN中的梯度消失和梯度爆炸

传统模式的 RNN

反向流传

梯度隐没和梯度爆炸的起因

不同之处

LSTM

IndRNN

源码

LSTMCell 的 build 办法

LSTMCell 的 call 办法

IndyLSTMCell 的 build 办法

IndyLSTMCell 的 call 办法

参考

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）