关于机器学习:条件随机场

条件随机场（conditional random field，CRF）是给定一组输出随机变量条件下另一组输入随机变量的条件概率分布模型，并且假如输入随机变量形成马尔可夫随机场（概率无向图模型）。
条件随机场定义：设$X$和$Y$是随机变量，$P(Y|X)$是在给定$X$的条件下$Y$的条件概率分布。若随机变量$Y$形成一个由无向图$G=(V,E)$示意的马尔可夫随机场，即式(1)
$$\begin{equation}P(Y_v|X,Y_w,w\neq v) = P(Y_v|X,Y_w,w\sim v)\tag{1}\end{equation}$$
对任意结点$v$成立，则条件概率分布$P(Y|X)$为条件随机场。式(1)中，$w \neq v$示意结点$v$以外的所有结点，$w\sim v$示意在图$G=(V,E)$中与结点$v$有边连贯的所有结点$w$，$Y_v$，$Y_u$和$Y_w$为结点$v$，$u$和$w$对应的随机变量。

线性链条件随机场定义：设$X=(X_1,X_2,\cdots,X_n)$，$Y=(Y_1,Y_2,\cdots,Y_n)$均为线性链示意的随机变量序列，若在给定随机变量序列$X$的条件下，随机变量序列$Y$的条件概率分布形成条件随机场，即满足式(2)，则称$P(Y|X)$为线性链条件随机场。
$$\begin{equation}P(Y_i|X,Y_1,\cdots,Y_{i-1},Y_{i+1},\cdots,Y_n)=P(Y_i|X,Y_{i-1},Y_{i+1})\tag{2}\end{equation}$$
其中，$i=1,\cdots,n$，当$i=1$或者$i=n$时，只思考单边。

团与最大团定义：无向图中任意两个结点之间均有边连贯的结点子集称为团(clique)。若$C$是无向图$G$的一个团，并且不能再退出任何一个$G$中的结点使其称为一个更大的团，则称$C$为最大团(maximul clique)。
将概率无向图的联结概率分布示意为其最大团上的随机变量的函数的乘积模式的操作，称为概率无向图模型的因子合成(factorization)。设无向图为$G$，$C$为$G$上的最大团，$Y_C$示意$C$对应的随机变量，则概率无向图模型的联结概率分布$P(Y)$能够写作所有最大团$C$上的函数$\Psi_C(Y_C)$的乘积模式，即
$$\begin{equation}P(Y)=\frac{1}{Z}\Pi_C \Psi_C(Y_C)\tag{3}\end{equation}$$
其中，$Z$为规范化因子，$Z=\sum_{Y}\Pi_C\Psi_C(Y_C)$，$\Psi_C(Y_C)$称为势函数(potential function)，要求是严格正的，通常定义为指数函数，即$\Psi_C(Y_C)=exp(-E(Y_C))$。
依据概率无向图的因子合成，得出线性链条件随机场$P(Y|X)$的因子合成式，当无向图为线性链模式是，最大团的大小为2，因而因子合成式中的因子是定义在相邻两个结点上的函数。

原始模式

设$P(Y|X)$为线性链条件随机场，则在随机变量$X$的取值为x的条件下，随机变量$Y$取值为$y$的条件概率如下，
$$\begin{equation}P(y|x)=\frac{1}{Z}\Pi_{t=1}^{T}\psi(y_{t-1},y_t,x)\tag{4}\end{equation}$$
其中$x=(x_1,\cdots,x_n)$为输出的观测序列，状态序列（输入序列）$y=(y_1,\cdots, y_T)$，$\psi(y_{t-1},y_t,x)\geq 0$为势函数。对势函数进行拆分，假设$\psi(y_{t-1},y_t,x)=\exp(\Delta_{y_{t-1},y_t,x} + \Delta_{y_t,x})$，令$\Delta_{y_{t-1},y_t,x}=\sum_{k}\lambda_k f_{k}^{transition}(y_{t-1},y_t,x,t)$，令$\Delta_{y_t,x}=\sum_{l}\mu_lf_l^{state}(y_t,x,t)$，则$P(y|x)$能够写为，
$$\begin{equation}P(y|x)=\frac{1}{Z(x)}\exp(\sum_{t,k}\lambda_k f_{k}^{transition}(y_{t-1},y_t,x,t)+\sum_{t,l}\mu_lf_l^{state}(y_t,x,t))\tag{5}\end{equation}$$
其中，$Z(x)$为规范化因子，$Z(x)=\sum_{y}\exp(\sum_{t,k}\lambda_k f_{k}^{transition}(y_{t-1},y_t,x,t)+\sum_{t,l}\mu_lf_l^{state}(y_t,x,t))$，$f_{k}^{transition}$为转移特征函数，$f_{l}^{state}$为状态特征函数。在传统的做法中，状态特征函数和转移特征函数是通过人工特色工程的形式设计的，它们的取值为1或0，即满足特色条件时取值为1，否则为0。

内积模式

对式(5)取对数，
$$\begin{equation}\begin{aligned}\log P(y|x)&=\sum_{t,k}\lambda_k f_{k}^{transition}(y_{t-1},y_t,x,t)+\sum_{t,l}\mu_lf_l^{state}(y_t,x,t)-\log Z(x)\\&=\sum_{t}(\sum_{i=1}^{K+L}w_i f_i(y_{t-1},y_t,x, t))-\log Z(x)\end{aligned}\tag{6}\end{equation}$$
其中，为了简便，对转移特色和状态特色用对立的符号示意，

$$\begin{equation}f_i(y_{t-1},y_t,x,t) = \begin{cases} f^{transition}_i(y_{t-1},y_t,x,t),i=1,\cdots,K \\ f^{state}_l(y_t,x,t), i=K+l;l=1,\cdots,L\end{cases}\tag{7}\end{equation} $$

$$\begin{equation}w_i = \begin{cases} \lambda_i,i=1,\cdots,K \\ \mu_l, i=K+l;l=1,\cdots,L\end{cases}\tag{8}\end{equation} $$

令$f_i(y,x)=\sum_{t=1}^{T}f_i(y_{t-1},y_t,x,t)$，$F(y,x)=(f_1(y,x),\cdots,f_{K+L}(y,x))^T$，$w=(w_1,\cdots,w_{K+L})^T$。
则条件随机场的对数模式$\log P(y|x)$能够写为$w$和$F$内积的模式
$$\begin{equation}\log P(y|x)=w\cdot F(y,x) - \log Z(x)\tag{9}\end{equation}$$

引入神经网络

式(4)的对数模式为，
$$\begin{equation}\log P(y|x)=\sum_{t=1}^T\log(\psi(y_{t-1},y_t,x_t)) - \log Z(x)\tag{10}\end{equation}$$
其中，$\psi(y_{t-1},y_t,x)$能够了解为依据$y_{t-1},y_t,x$提取的特色的加权和，这一部分能够应用神经网络代替人工特色工程。
引入神经网络示意转移特征函数$f^{transition}$和状态特征函数$f^{state}$，能够将条件随机场的对数模式简化为，
$$\begin{equation}\log P(y|x)=\sum_{t}f^{transition}(y_{t-1},y_t,x)+\sum_{t}f^{state}(y_t,x)-\log Z(x)\tag{11}\end{equation}$$
思考到以后深度学习模型中，RNN或者层叠CNN或是Bert等模型曾经可能使得$f^{state}$充沛地捕获各个y与输入x的分割，因而，无妨思考转移特征函数$f^{transition}$与x无关，那么能够进一步简化为，
$$\begin{equation}\log P(y|x)=\underbrace{\sum_{t}f^{transition}(y_{t-1},y_t)}_{(a)}+\underbrace{\sum_{t}f^{state}(y_t,x)}_{(b)}-\underbrace{\log Z(x)}_{(c)}\tag{12}\end{equation}$$
对上式取负，应用极大似然法，求解参数。
因而，损失函数的计算分为三局部，别离对应式中的(a)、(b)和(c)。

一元概率

def crf_unary_score(tag_indices, sequence_lengths, inputs):  """Computes the unary scores of tag sequences.  Args:    tag_indices: A [batch_size, max_seq_len] matrix of tag indices.    sequence_lengths: A [batch_size] vector of true sequence lengths.    inputs: A [batch_size, max_seq_len, num_tags] tensor of unary potentials.  Returns:    unary_scores: A [batch_size] vector of unary scores.  """  batch_size = array_ops.shape(inputs)[0]  max_seq_len = array_ops.shape(inputs)[1]  num_tags = array_ops.shape(inputs)[2]  flattened_inputs = array_ops.reshape(inputs, [-1])  offsets = array_ops.expand_dims(      math_ops.range(batch_size) * max_seq_len * num_tags, 1)  offsets += array_ops.expand_dims(math_ops.range(max_seq_len) * num_tags, 0)  # Use int32 or int64 based on tag_indices' dtype.  if tag_indices.dtype == dtypes.int64:    offsets = math_ops.cast(offsets, dtypes.int64)  flattened_tag_indices = array_ops.reshape(offsets + tag_indices, [-1])  unary_scores = array_ops.reshape(      array_ops.gather(flattened_inputs, flattened_tag_indices),      [batch_size, max_seq_len])  masks = array_ops.sequence_mask(sequence_lengths,                                  maxlen=array_ops.shape(tag_indices)[1],                                  dtype=dtypes.float32)  unary_scores = math_ops.reduce_sum(unary_scores * masks, 1)  return unary_scores

crf_unary_score计算一元概率，输入一个维度为[batch_size]的向量，向量中的每一个元素是一个sequence中所有的实在标签概率之和。对应式中的(b)项。输出inputs的维度为[batch_size, max_seq_len, num_tags]，个别为BiLSTM或者是Bert的输入，num_tags为标签数，inputsi[k]示意的是，以后batch中的第i个序列，其地位j对应的输入属于类别k的概率。flattened_inputs将input变为一维， tag_indices为实在的标签，维度为[batch_size, max_seq_len]，offsets的维度为[batch_size, max_seq_len], offsetsi = (batch_size i + max_seq_len j) * num_tags，因而，offsets + tag_indices中(i，j)地位的值示意的是，以后batch中的第i个序列，其地位j对应的输入属于类别tag_indicesi的概率位于flatten_inputs，拉平后失去 flattened_tag_indices，通过gather函数，失去所有的一元概率。最初，应用sequence_mask依据序列的真是长度，生成mask，失去所有无效的一元概率之和unary_scores。

二元概率

def crf_binary_score(tag_indices, sequence_lengths, transition_params):  """Computes the binary scores of tag sequences.  Args:    tag_indices: A [batch_size, max_seq_len] matrix of tag indices.    sequence_lengths: A [batch_size] vector of true sequence lengths.    transition_params: A [num_tags, num_tags] matrix of binary potentials.  Returns:    binary_scores: A [batch_size] vector of binary scores.  """  # Get shape information.  num_tags = transition_params.get_shape()[0]  num_transitions = array_ops.shape(tag_indices)[1] - 1  # Truncate by one on each side of the sequence to get the start and end  # indices of each transition.  start_tag_indices = array_ops.slice(tag_indices, [0, 0],                                      [-1, num_transitions])  end_tag_indices = array_ops.slice(tag_indices, [0, 1], [-1, num_transitions])  # Encode the indices in a flattened representation.  flattened_transition_indices = start_tag_indices * num_tags + end_tag_indices  flattened_transition_params = array_ops.reshape(transition_params, [-1])  # Get the binary scores based on the flattened representation.  binary_scores = array_ops.gather(flattened_transition_params,                                   flattened_transition_indices)  masks = array_ops.sequence_mask(sequence_lengths,                                  maxlen=array_ops.shape(tag_indices)[1],                                  dtype=dtypes.float32)  truncated_masks = array_ops.slice(masks, [0, 1], [-1, -1])  binary_scores = math_ops.reduce_sum(binary_scores * truncated_masks, 1)  return binary_scores

crf_binary_score计算二元概率，输入维度为[batch_size]，向量中的每个元素是一个sequence中所有的转移概率之和。对应式中的(a)项。序列的长度为num_transitions，则会产生num_transitions-1此转移，转移的起始下标start_tag_indices对应0，$\cdots$，num_transitions-2，完结下标end_tag_indices对应1，$\cdots$，num_transitions-1。应用与crf_unary_score相似的原理取出对应地位的转移概率，进行mask操作求和，返回binary_scores。

log normalization factor

  def _multi_seq_fn():    """Forward computation of alpha values."""    rest_of_input = array_ops.slice(inputs, [0, 1, 0], [-1, -1, -1])    # Compute the alpha values in the forward algorithm in order to get the    # partition function.    forward_cell = CrfForwardRnnCell(transition_params)    # Sequence length is not allowed to be less than zero.    sequence_lengths_less_one = math_ops.maximum(        constant_op.constant(0, dtype=sequence_lengths.dtype),        sequence_lengths - 1)    _, alphas = rnn.dynamic_rnn(        cell=forward_cell,        inputs=rest_of_input,        sequence_length=sequence_lengths_less_one,        initial_state=first_input,        dtype=dtypes.float32)    log_norm = math_ops.reduce_logsumexp(alphas, [1])    # Mask `log_norm` of the sequences with length <= zero.    log_norm = array_ops.where(math_ops.less_equal(sequence_lengths, 0),                               array_ops.zeros_like(log_norm),                               log_norm)    return log_norm

normalization factor $Z$的计算形式，
$$\begin{equation}Z=\sum_y \Pi_{t=1}^T\psi(y_{t-1},y_t,x)\tag{13}\end{equation}$$
应用前向算法计算$Z$，令
$$\begin{equation}\alpha_j(y_j=s)=\sum_{y1,\cdots,y_{j-1}}(\Pi_{t=1}^{j-1}\psi(y_{t-1},y_t,x))\psi(y_{j-1},y_j=s,x)\tag{14}\end{equation}$$
则有
$$\begin{equation}\begin{aligned}\alpha_{j+1}(y_{j+1}=s^{'})&=\sum_{y_1,\cdots,y_j}(\Pi_{t=1}^{j}\psi(y_{t-1},y_t,x))\psi(y_{j},y_{j+1}=s^{'},x)\\&=\sum_{s=1}^{N}(\sum_{y1,\cdots,y_{j-1}}(\Pi_{t=1}^{j-1}\psi(y_{t-1},y_t,x))\psi(y_{j-1},y_j=s,x))\psi(y_{j}=s,y_{j+1}=s^{'},x)\\&=\sum_{s=1}^N\alpha_j(y_j=s)\psi(y_{j}=s,y_{j+1}=s^{'},x)\\&=\sum_{s=1}^N\alpha_j(y_j=s)\exp(f^{transition}(y_j=s,y_{j+1}=s^{'})+\underbrace{f^{state}(y_{t+1},x)}_{与y_j无关})\\&=\sum_{s=1}^N\underbrace{\alpha_j(y_j=s)}_{(a)}\underbrace{\exp(f^{transition}(y_j=s,y_{j+1}=s^{'}))}_{(b)}\underbrace{\exp(f^{state}(y_{t+1},x))}_{(c)}\end{aligned}\tag{15}\end{equation}$$

class CrfForwardRnnCell(rnn_cell.RNNCell):  """Computes the alpha values in a linear-chain CRF.  See http://www.cs.columbia.edu/~mcollins/fb.pdf for reference.  """  def __init__(self, transition_params):    """Initialize the CrfForwardRnnCell.    Args:      transition_params: A [num_tags, num_tags] matrix of binary potentials.          This matrix is expanded into a [1, num_tags, num_tags] in preparation          for the broadcast summation occurring within the cell.    """    self._transition_params = array_ops.expand_dims(transition_params, 0)    self._num_tags = tensor_shape.dimension_value(transition_params.shape[0])  @property  def state_size(self):    return self._num_tags  @property  def output_size(self):    return self._num_tags  def __call__(self, inputs, state, scope=None):    """Build the CrfForwardRnnCell.    Args:      inputs: A [batch_size, num_tags] matrix of unary potentials.      state: A [batch_size, num_tags] matrix containing the previous alpha          values.      scope: Unused variable scope of this cell.    Returns:      new_alphas, new_alphas: A pair of [batch_size, num_tags] matrices          values containing the new alpha values.    """    state = array_ops.expand_dims(state, 2)    # This addition op broadcasts self._transitions_params along the zeroth    # dimension and state along the second dimension. This performs the    # multiplication of previous alpha values and the current binary potentials    # in log space.    transition_scores = state + self._transition_params    new_alphas = inputs + math_ops.reduce_logsumexp(transition_scores, [1])    # Both the state and the output of this RNN cell contain the alphas values.    # The output value is currently unused and simply satisfies the RNN API.    # This could be useful in the future if we need to compute marginal    # probabilities, which would require the accumulated alpha values at every    # time step.    return new_alphas, new_alphas

CrfForwardRnnCell(rnn_cell.RNNCell)基于RNN的架构，在工夫步t接管一个输出$x_t$，输入$o_t$，生成新的hidden state $h_t$。
其中，$x_t$为$f^{state}(y_t,x)$，$h_t$即为前向算法中的$\alpha_t=(\alpha_t(y_t=1),\cdots,\alpha_t(y_t=N))^T$，如图所示，在更新$\alpha$时，应用的是加法而非乘法，这是因为，所有的计算都是在log空间中计算的，对应式(15)中的$\log(a) + \log(b) = log(ab)$，所以transition_scores = state + self._transition_params计算失去的矩阵，其第i行第j列的元素是$\log(\alpha_{t-1}(y_{t-1}=i)\times \exp(f^{transition}(y_{t-1}=i,y_t=j)))$，math_ops.reduce_logsumexp(transition_scores, [1])先$\exp$再求和再取对数，失去$\log(\alpha_{t-1}^T F(y_{t-1},y_t))$，再加上inputs，理论是$\log(\sum ab)+ \log (c)$，即$\log((\sum ab)c)$，因而失去的新的$\alpha_{t+1}$（hidden state）后果也是在log空间中的，不便下一个工夫步的计算。
inputs的维度是[batch_size, num_tags]，state的维度为[batch_size,num_tags]，transition_params的维度为[num_tags,num_tags]。

预测

条件随机场的预测应用维特比算法

def viterbi_decode(score, transition_params):  """Decode the highest scoring sequence of tags outside of TensorFlow.  This should only be used at test time.  Args:    score: A [seq_len, num_tags] matrix of unary potentials.    transition_params: A [num_tags, num_tags] matrix of binary potentials.  Returns:    viterbi: A [seq_len] list of integers containing the highest scoring tag        indices.    viterbi_score: A float containing the score for the Viterbi sequence.  """  trellis = np.zeros_like(score)  backpointers = np.zeros_like(score, dtype=np.int32)  trellis[0] = score[0]  for t in range(1, score.shape[0]):    v = np.expand_dims(trellis[t - 1], 1) + transition_params    trellis[t] = score[t] + np.max(v, 0)    backpointers[t] = np.argmax(v, 0)  viterbi = [np.argmax(trellis[-1])]  for bp in reversed(backpointers[1:]):    viterbi.append(bp[viterbi[-1]])  viterbi.reverse()  viterbi_score = np.max(trellis[-1])  return viterbi, viterbi_score

其中，trellis记录非规范化概率，backpointers记录门路。

参考

http://www.cs.columbia.edu/~m...
LSTM-Based NeuroCRFs for Name Entity Recognition
统计学习办法