关于自然语言处理:transformerxl

vanilla Transformer 中的 相对地位编码，
$$\begin{equation}PE(pos,2i)=\sin(pos/10000^{\frac{2i}{d_{model}}})\tag{1}\end{equation}$$
$$\begin{equation}PE(pos,2i+1)=\cos(pos/10000^{\frac{2i}{d_{model}}})\tag{2}\end{equation}$$

def positional_embedding(pos_seq, inv_freq, bsz=None):
    sinusoid_inp = tf.einsum('i,j->ij', pos_seq, inv_freq)
    pos_emb = tf.concat([tf.sin(sinusoid_inp), tf.cos(sinusoid_inp)], -1)
    if bsz is not None:
        return tf.tile(pos_emb[:, None, :], [1, bsz, 1])
    else:
        return pos_emb[:, None, :]

pos_seq 和 inv_freq 别离为

pos_seq = tf.range(klen - 1, -1, -1.0) 
inv_freq = 1 / (10000 ** (tf.range(0, d_model, 2.0) / d_model))

position embeding 的实现略有些不同，sinusoid_inp的 shape 为[len_pos, d_model//2]，len_pos 为序列长度，d_model 为 embedding 的维度。因而，理论失去的 postition embedding，前 d_model// 2 维采纳的是式(1)，后 d_model// 2 维采纳的是式(2)。

$$\begin{equation}h_{t+1}=f(h_t,E_{s_{t+1}}+U_{1:L})\tag{3}\end{equation}$$
$t+1$ 时刻的 segment 的 hidden state，依赖于前一时刻 segment 的 hidden state $h_t$，以及以后时刻输出序列 $s_{t+1}$ 的 word embedding $E_{s_{t+1}}$ 和相对地位编码 $U_{1:L}$。显然，这样存在一个问题，即地位编码 $U_{1:L}$ 对所有的 segment 都是一样的，对于输出 $x_{t,j}$ 和 $x_{t+1,j}$（$j=1,\cdots,L$），模型无奈区别两者的地位 embedding。
为了解决这一问题，transformer-xl 采纳 绝对地位编码 。
在 vanilla Transformer 中，scaled dot-product attention 的计算形式为
$$\begin{equation}Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V\tag{4}\end{equation}$$
应用绝度地位编码，计算 query $q_i$ 和 key $k_j$ 之间的 attention score，
$$\begin{equation}\begin{aligned}A_{i,j}^{abs}&=(E^T_{x_i}+U^T_i)W^T_q((E^T_{x_j}+U^T_j)W_k^T)^T\\&=(E^T_{x_i}+U^T_i)W^T_qW_k(E_{x_j}+U_j)\\&=\underbrace{E^T_{x_i}W^T_qW_kE_{x_j}}_{(a)}+\underbrace{E^T_{x_i}W^T_qW_kU_j}_{(b)}\\&+\underbrace{U^T_iW^T_qW_kE_{x_j}}_{(c)}+\underbrace{U^T_iW^T_qW_kU_j}_{(d)}\end{aligned}\tag{5}\end{equation}$$
其中，$U_i$ 和 $U_j$ 为相对地位编码。对式 (xx) 进行改良，引入绝对地位编码，
$$\begin{equation}\begin{aligned}A_{i,j}^{rel}&=\underbrace{E^T_{x_i}W^T_qW_{k,E}E_{x_j}}_{(a)}+\underbrace{E^T_{x_i}W^T_qW_{k,R}R_{i-j}}_{(b)}\\&+\underbrace{u^TW_{k,E}E_{x_j}}_{(c)}+\underbrace{v^TW_{k,R}R_{i-j}}_{(d)}\end{aligned}\tag{6}\end{equation}$$
次要有 3 点改良

将式 (xx) 中的相对地位编码 $U_j$ 改为绝对地位编码 $R_{i-j}$，因而在计算 attention score 关注的是绝对地位 $i-j$。$R$ 为 sinusoid encoding matrix，其参数不参加训练；
别离应用 $u\in \mathbb{R}^d$ 和 $v\in \mathbb{R}^d$ 替换项 $(c)$ 和项 $(d)$ 中的 $U_i^TW_q^T$，论文的解释是In this case, since the query vector is the same for all query positions, it suggests that the attentive bias towards different words should remain the same regardless of the query position。即，$U_i^TW_q^T\rightarrow R_{i-i}^TW^T_q=R^T_0W_q^T$，不论 query position 是什么，都是一样的。$u$ 和 $v$ 在训练过程中参加更新。（这里不太明确，为什么应用不一样的 $u$ 和 $v$）
将原始的权重矩阵 $W_k$ 变为 $W_{k,E}$ 和 $W_{k,R}$，别离用来产生基于内容的 key vector 和基于地位的 key vector。

如式 (6) 所示，$A_{i,j}^{rel}$ 须要计算 $W_{k,R}R_{i-j}$，留神到绝对间隔 $i-j$ 只可能是 0~M+L- 1 的整数，其中 M 为 memory 的长度（memory length），L 为一个 segment 的序列长度（segment length）。
假如以后时刻 segment 的长度为 $L$，输出为 $(x_1, \cdots,x_L)$，memory 的长度为 M，memory 序列为 $(x_{-(M-1)},\cdots,x_{-1}, x_{0})$。对于输出 $x_1$，它能够应用的历史信息为 $x_{-(M-1)},\cdots,x_{-1}, x_{0},x_{1}$，对于输出 $x_2$，它能够应用的历史信息为 $x_{-(M-1)},\cdots,x_{-1}, x_{0}，x_{1},x_{2}$，对于输出 $x_L$，它能够应用的历史信息为 $x_{-(M-1)},\cdots,x_{-1}, x_{0}，x_{1},x_{2},\cdots,x_{L}$。对于所有可能的 (i,j)，有 $\{(i,j)|i=1,\cdots,L,j=-(M-1),\cdots,i\}$，计算式(6) 的(b)项中的 $W_{k,R}R_{i-j}$，失去行矩阵
$$\begin{equation}Q:=\begin{bmatrix}R_{M+L-1}^T \\ R_{M+L-2}^T \\ \vdots \\ R_1^T \\ R_0^T\end{bmatrix}W_{k,R}^T=\begin{bmatrix}[W_{k,R}R_{M+L-1}]^T \\ [W_{k,R}R_{M+L-2}]^T \\ \vdots \\ [W_{k,R}R_1]^T \\ [W_{k,R}R_0]^T\end{bmatrix}\in \mathbb{R}^{(M+L)\times d}\tag{7}\end{equation}$$
令式 (6) 的(b)项中的 $E_{x_i}^TW_q^T=q_{i-1}^T$，则思考所有正当 (i,j)，(b) 项可写为矩阵的模式

$$\begin{equation}\begin{aligned}B&=\begin{bmatrix}q_0^TW_{k,R}R_M & \cdots & q_0^TW_{k,R}R_0 & 0 & \cdots & 0 \\ q_1^TW_{k,R}R_{M+1} & \cdots & q_1^TW_{k,R}R_1 & q_1^TW_{k,R}R_0 & \vdots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ q_{L-1}^TW_{k,R}R_{M+L-1} & \cdots & q_{L-1}^TW_{k,R}R_{L} & q_{L-1}^TW_{k,R}R_{L-1} & \cdots & q_{L-1}^TW_{k,R}R_{0}\end{bmatrix}\\ &=\begin{bmatrix}q_0^TQ_{L-1} & \cdots & q_0^TQ_{M+L-1} & 0 & \cdots & 0 \\ q_1^tQ_{L-2} & \cdots & q_1^TQ_{M+L-2} & q_1^TW_{k,R}Q_{M+L-1} & \vdots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ q_{L-1}^TQ_0 & \cdots & q_{L-1}^TQ_M & q_{L-1}^TQ_{M+1} & \cdots & q_{L-1}^TQ_{M+L-1}\end{bmatrix}\end{aligned}\tag{8}\end{equation}$$
定义 $\tilde{Q}$ 为如下矩阵模式
$$\begin{equation}\tilde{Q}=qQ^T=\begin{bmatrix} q_{0}^TQ_0 & \cdots & q_{0}^TQ_M & q_{0}^TQ_{M+1} & \cdots & q_{0}^TQ_{M+L-1} \\ q_{1}^TQ_0 & \cdots & q_{1}^TQ_M & q_{1}^TQ_{M+1} & \cdots & q_{1}^TQ_{M+L-1} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ q_{L-1}^TQ_0 & \cdots & q_{L-1}^TQ_M & q_{L-1}^TQ_{M+1} & \cdots & q_{L-1}^TQ_{M+L-1}\end{bmatrix}\tag{9}\end{equation}$$
能够返现，通过将矩阵 $\tilde{B}$ 的第 i 行左移 $L-i$ 位，即可失去 $B$。因而，在计算式 (6) 的(b)项时，能够先计算 $qQ^T$ 失去 $\tilde{Q}$，再进行左移操作，失去 $B$。

实现左移

def rel_shift(x):
  x_size = tf.shape(x)
  x = tf.pad(x, [[0, 0], [1, 0], [0, 0], [0, 0]])
  x = tf.reshape(x, [x_size[1] + 1, x_size[0], x_size[2], x_size[3]])
  x = tf.slice(x, [1, 0, 0, 0], [-1, -1, -1, -1])
  x = tf.reshape(x, x_size)
  return x

这篇论文【代码解析】Transformer-XL 之 Relative Positional Encodings 的图例很好。

def rel_multihead_attn(w, r, r_w_bias, r_r_bias, attn_mask, mems, d_model,
                       n_head, d_head, dropout, dropatt, is_training,
                       kernel_initializer, scope='rel_attn'):
    scale = 1 / (d_head ** 0.5) # 避免内积过大？with tf.variable_scope(scope):
        qlen = tf.shape(w)[0]
        rlen = tf.shape(r)[0] # =k_len
        bsz = tf.shape(w)[1]

        cat = tf.concat([mems, w], # [m_len, bsz, emb_dim]  [q_len, bsz. emb_dim] = [k_len=m_len+q_len, bsz, emb_dim]
                        0) if mems is not None and mems.shape.ndims > 1 else w
        # [k_len, bsz, 3*n_head*d_head]
        w_heads = tf.layers.dense(cat, 3 * n_head * d_head, use_bias=False,
                                  kernel_initializer=kernel_initializer, name='qkv')
        # r_head_k: [rlen, bsz, n_head*d_head]
        r_head_k = tf.layers.dense(r, n_head * d_head, use_bias=False,
                                   kernel_initializer=kernel_initializer, name='r')

        w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, -1)
        w_head_q = w_head_q[-qlen:] # query 只有前 q_len

        klen = tf.shape(w_head_k)[0]

        w_head_q = tf.reshape(w_head_q, [qlen, bsz, n_head, d_head])
        w_head_k = tf.reshape(w_head_k, [klen, bsz, n_head, d_head])
        w_head_v = tf.reshape(w_head_v, [klen, bsz, n_head, d_head])

        r_head_k = tf.reshape(r_head_k, [rlen, n_head, d_head])

        rw_head_q = w_head_q + r_w_bias # E_x W_q + u  [qlen, bsz, n_head, d_head]
        rr_head_q = w_head_q + r_r_bias # E_x W_q + v

        # AC [qlen, klen, bsz, n_head]
        AC = tf.einsum('ibnd,jbnd->ijbn', rw_head_q, w_head_k) # term a + term c
        # [qlen, rlen, bsz, n_head]
        BD = tf.einsum('ibnd,jnd->ijbn', rr_head_q, r_head_k) # term b + term d
        BD = rel_shift(BD)

        attn_score = (AC + BD) * scale
        attn_mask_t = attn_mask[:, :, None, None] # attention mask 保障每一个只会用到前 k_len- 1 个
        attn_score = attn_score * (1 - attn_mask_t) - 1e30 * attn_mask_t

        attn_prob = tf.nn.softmax(attn_score, 1) # [qlen, klen, bsz, n_head]
        attn_prob = tf.layers.dropout(attn_prob, dropatt, training=is_training)

        # [qlen, klen, bsz, n_head] [klen, bsz, n_head, d_head] -> [qlen, bsz, n_head, d_head]
        attn_vec = tf.einsum('ijbn,jbnd->ibnd', attn_prob, w_head_v)
        size_t = tf.shape(attn_vec)
        # 多头拼接
        attn_vec = tf.reshape(attn_vec, [size_t[0], size_t[1], n_head * d_head])

        # 使维度等于输出维度，不便加残差
        attn_out = tf.layers.dense(attn_vec, d_model, use_bias=False,
                                   kernel_initializer=kernel_initializer, name='o')
        attn_out = tf.layers.dropout(attn_out, dropout, training=is_training)

        output = tf.contrib.layers.layer_norm(attn_out + w, begin_norm_axis=-1)
    return output

为了保障在计算 $t+1$ 时刻的 segment 的第 $j$ 个地位（$j=1,\cdots,q\_len$）的 hidden state 时，不应用地位 $j$ 之后的信息，须要对 attention score 进行 mask 操作。attn_score = (AC + BD) * scale attn_mask_t = attn_mask[:, :, None, None] # attention mask attn_score = attn_score * (1 - attn_mask_t) - 1e30 * attn_mask_t
其中 attn_mask 的计算形式如下，

def _create_mask(qlen, mlen, same_length=False):
  attn_mask = tf.ones([qlen, qlen])
  mask_u = tf.matrix_band_part(attn_mask, 0, -1)
  mask_dia = tf.matrix_band_part(attn_mask, 0, 0)
  attn_mask_pad = tf.zeros([qlen, mlen])
  ret = tf.concat([attn_mask_pad, mask_u - mask_dia], 1)
  if same_length:
    mask_l = tf.matrix_band_part(attn_mask, -1, 0)
    ret = tf.concat([ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1)
  return ret

attention mask 有两种模式

如果生成如图 (a) 所示的 mask，那么 $t+1$ 时刻的 segment 的第 $j$ 个地位（$j=1,\cdots,q\_len$），在计算 attention score 时，会应用在它之前的 $(m\_len+j)$ 个地位的信息（包含其本身），对应代码中 same_length 为 false 的状况；
如果生成如图 (b) 所示的 mask，那么 $t+1$ 时刻的 segment 的第 $j$ 个地位（$j=1,\cdots,q\_len$），在计算 attention score 时，只会应用在它之前的 $(m\_len+1)$ 个地位的信息（包含其本身）, 对于任意的 $j$ 都是一样的，对应代码中 same_length 为 true 的状况。

须要留神的是，当 same_length 为 true 时，应用代码 ret=tf.concat([ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1) 生成 mask，是为了同时思考 m_len>=q_len（对应图 (b)）和 m_len<q_len(对应图(c)) 的状况。

传统的做法都将每一个单词示意为长度相等的向量，当单词的量很大时，须要大量的存储空间，此外，不同的单词的重要水平不同，且所表白语义的丰盛水平不同，对于一些简略的单词，可能只须要长度较短的向量就能够很好地表征它们的语义，而对于一些能表白更丰盛语义的单词或者是更重要的单词（比方一些高频词，高频的词的一词多义的景象更加显著），能够应用较长的向量对其进行示意。

对于一个数据量较大的语料库，往往较少的高频词就能笼罩大部分的句子。因而，在模型的训练过程中，高频词会常常被更新，而低频词被更新的次数比比皆是。基于这一点，咱们心愿更新单词的 embedding 时，低频的词应该应用更少的资源，而高频的词能够减少一些资源（这里的资源尤指维度）。论文《Adaptive input representations for neural language modeling》基于这一思维提出了 `adaptive representation` 算法。adaptive Representation 将 vocabulary 中的单词但呈现的频率从高到低排列，并划分为多个汇合，较小序号的汇合中，单词的频率较高。汇合内的单词维度雷同，汇合间单词的维度不同，汇合序号越大，维度越小设置，第 n 个汇合的维度为 $\frac{d}{k^n} (k = 4)$，其中 n 为列表序号（从 0 开始），d 为原始维度。

transformer-xl 中也采纳了 adaptive representation 算法。

def mask_adaptive_embedding_lookup(x, n_token, d_embed, d_proj, cutoffs, initializer,
                                   proj_initializer, div_val=1,
                                   proj_same_dim=True,
                                   scope='adaptive_embed', **kwargs):
    emb_scale = d_proj ** 0.5
    with tf.variable_scope(scope):
        if div_val == 1:
          lookup_table = tf.get_variable('lookup_table', [n_token, d_embed],
                                         initializer=initializer)
          y = embedding_lookup(lookup_table, x, use_tpu=False)
          if d_proj != d_embed: # hidden state 的维度与 embedding 的维度不等
            proj_W = tf.get_variable('proj_W', [d_embed, d_proj],
                                     initializer=proj_initializer)
            y = tf.einsum('ibe,ed->ibd', y, proj_W)
          else:
            proj_W = None
          ret_params = [lookup_table, proj_W]
        else:
          tables, projs = [], []
          cutoff_ends = [0] + cutoffs + [n_token]
          x_size = tf.shape(x)
          y = tf.zeros([x_size[0], x_size[1], d_proj]) # [len, bsz, d_proj]
          for i in range(len(cutoff_ends) - 1):
            with tf.variable_scope('cutoff_{}'.format(i)):
              l_idx, r_idx = cutoff_ends[i], cutoff_ends[i + 1]
              mask = (x >= l_idx) & (x < r_idx) # 按位与
              cur_x = tf.boolean_mask(x, mask) - l_idx # 下标从 0 开始
              cur_d_embed = d_embed // (div_val ** i)
              # 留神，每一个 lookup_table 是在 cutoff_{i}空间下的，所以是不一样的
              lookup_table = tf.get_variable('lookup_table',
                                             [r_idx - l_idx, cur_d_embed],
                                             initializer=initializer)
              cur_y = embedding_lookup(lookup_table, cur_x, use_tpu=False) 
              if d_proj == cur_d_embed and not proj_same_dim:
                proj_W = None
              else:
                proj_W = tf.get_variable('proj_W', [cur_d_embed, d_proj],
                                         initializer=proj_initializer)
                cur_y = tf.einsum('id,de->ie', cur_y, proj_W)
              mask_idx = tf.to_int64(tf.where(mask)) # mask 中为 1 的地位坐标，y += tf.scatter_nd(mask_idx, cur_y, tf.to_int64(tf.shape(y)))
              tables.append(lookup_table)
              projs.append(proj_W)
          ret_params = [tables, projs]
        y *= emb_scale # [seq_len, bsz, d_proj]
    return y, ret_params

其中，cutoff_ends示意汇合划分的切分点，第 i 个汇合的下标大于等于 cutoff_ends[i]，小于 cutoff_ends[i+1]，第 i 个汇合对应的 embedding 维度为 cur_d_embed = d_embed // (div_val ** i)，对应的 embedding lookup table 为 lookup_table=tf.get_variable('lookup_table', [r_idx - l_idx, cur_d_embed], initializer=initializer)，对于一个 segment 中的 len 个单词，其 embedding 的长度是不一样的，所以须要应用一个 projection layer，进行维度变换，使得维度雷同，proj_W = tf.get_variable('proj_W', [cur_d_embed, d_proj], initializer=proj_initializer) cur_y = tf.einsum('id,de->ie', cur_y, proj_W)。
给定一个输出 x，维度为 [len,bsz]，len 为一个 segment 的长度，bsz 为 batch size，为了查找第 i 个汇合中的单词，首先应用位运算mask = (x >= l_idx) & (x < r_idx)，失去 shape 为[len，bsz] 的 mask，元素为 true 示意以后地位对应的单词在第 i 个汇合中，否则不在，接着应用 cur_x = tf.boolean_mask(x, mask) - l_idx 失去一个一维序列，方便使用 embeding lookup 进行示意，假如一共有 n 个汇合，就会失去 n 组长度不同的 embedding，对其进行拼接，应用 tf.scatter_nd 函数，它的作用是 Scatter updates into a new tensor according to indices。一个简化版的案例如下，

    d_embed = 4
    len = 3
    bsz = 2
    x = np.arange(1, len*bsz+1).reshape(len, bsz)
    l_idx = 2
    r_idx = 5
    mask = (x >= l_idx) & (x < r_idx)
    cur_x = tf.boolean_mask(x, mask) - l_idx
    lookup_table = tf.get_variable("lookup_table", shape=[r_idx-l_idx, d_embed], initializer=tf.random_normal_initializer)
    cur_y = tf.nn.embedding_lookup(lookup_table, cur_x)
    y_shape = tf.zeros(shape=[len, bsz, d_embed])
    y = tf.scatter_nd(indices=tf.where(mask), updates=cur_y, shape=tf.to_int64(tf.shape(y_shape)))
    print("x:", x)
    print("mask:", mask)
    print("cur_x:", cur_x)
    print("cur_y:", cur_y)
    print("y:", y)
输入：x:  [[1 2]
 [3 4]
 [5 6]]
mask:  [[False  True]
 [True  True]
 [False False]]
cur_x:  tf.Tensor([0 1 2], shape=(3,), dtype=int32)
cur_y: tf.Tensor([[-2.7328973  -0.01377826 -0.78023756 -1.1186032]
 [-1.7653402  -0.8459847  -0.3368531  -0.27648798]
 [0.2573444   1.1644957   1.0869092   1.3614684]], shape=(3, 4), dtype=float32)
y:  tf.Tensor([[[ 0.          0.          0.          0.]
  [-2.7328973  -0.01377826 -0.78023756 -1.1186032]]
 [[-1.7653402  -0.8459847  -0.3368531  -0.27648798]
  [0.2573444   1.1644957   1.0869092   1.3614684]]
 [[0.          0.          0.          0.]
  [0.          0.          0.          0.]]], shape=(3, 2, 4), dtype=float32)

https://www.jianshu.com/p/c06…
https://zhuanlan.zhihu.com/p/…
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

关于自然语言处理:transformerxl

地位编码

相对地位

绝对地位

地位编码的 trick

实现左移

adaptive embedding

参考

Just My Socks（注册教程内含优惠码）

关于自然语言处理:transformerxl

地位编码

相对地位

绝对地位

地位编码的 trick

实现左移

adaptive embedding

参考

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）