NLP教程：教你如何自动生成对联

jiezi

6 年前

桃符早易朱红纸，杨柳轻摇翡翠群 ——FlyAI Couplets
体验对对联 Demo: https://www.flyai.com/couplets

循环神经网络最重要的特点就是可以将序列作为输入和输出，而对联的上联和下联都是典型的序列文字，那么，能否使用神经网络进行对对联呢？答案是肯定的。本项目使用网络上收集的对联数据集地址作为训练数据，运用 Seq2Seq + 注意力机制网络完成了根据上联对下联的任务。
项目流程

数据处理
Seq2Seq + Attention 模型解读
模型代码实现
训练神经网络

数据处理
创建词向量字典和词袋字典
在原始数据集中，对联中每个汉字使用空格进行分割，格式如下所示：
室内崇兰映日, 林间修竹当风

翠岸青荷，琴曲潇潇情辗转, 寒山古月，风声瑟瑟意彷徨
由于每个汉字表示一个单一的词，因此不需要对原始数据进行分词。在获取原始数据之后，需要创建两个字典，分别是字到词向量的字典和字到词袋的字典，这样做是为了将词向量输入到网络中，而输出处使用词袋进行分类。在词袋模型中，添加三个关键字 ‘“‘, ‘”‘ 和 ‘ ~ ‘，分别代表输入输出的起始，结束和空白处的补零，其关键字分别为 1，2，0。
class Processor(Base): ## Processor 是进行数据处理的类

def __init__(self):
super(Processor, self).__init__()
embedding_path = os.path.join(DATA_PATH, ’embedding.json’) ## 加载词向量字典
words_list_path = os.path.join(DATA_PATH, ‘words.json’) ## 加载词袋列表
with open(embedding_path, encoding=’utf-8′) as f:
self.vocab = json.loads(f.read())
with open(words_list_path, encoding=’utf-8′) as f:
word_list = json.loads(f.read())
self.word2ix = {w:i for i,w in enumerate(word_list, start = 3)}
self.word2ix[‘“’] = 1 ## 句子开头为 1
self.word2ix[‘”’] = 2 ## 句子结尾为 2
self.word2ix[‘~’] = 0 ##padding 的内容为 0
self.ix2word = {i:w for w,i in self.word2ix.items()}
self.max_sts_len = 40 ## 最大序列长度
对上联进行词向量编码
def input_x(self, upper): ##upper 为输入的上联

word_list = []
#review = upper.strip().split(‘ ‘)
review = [‘“’] + upper.strip().split(‘ ‘) + [‘”’] ## 开头加符号 1，结束加符号 2
for word in review:
embedding_vector = self.vocab.get(word)
if embedding_vector is not None:
if len(embedding_vector) == 200:
# 给出现在编码词典中的词汇编码
embedding_vector = list(map(lambda x: float(x),embedding_vector)) ## convert element type from str to float in the list
word_list.append(embedding_vector)

if len(word_list) >= self.max_sts_len:
word_list = word_list[:self.max_sts_len]
origanal_len = self.max_sts_len
else:
origanal_len = len(word_list)
for i in range(len(word_list), self.max_sts_len):
word_list.append([0 for j in range(200)]) ## 词向量维度为 200
word_list.append([origanal_len for j in range(200)]) ## 最后一行元素为句子实际长度
word_list = np.stack(word_list)
return word_list
对真实下联进行词袋编码
def input_y(self, lower):

word_list = [1] ## 开头加起始符号 1
for word in lower:
word_idx = self.word2ix.get(word)
if word_idx is not None:
word_list.append(word_idx)

word_list.append(2) ## 结束加终止符号 2
origanal_len = len(word_list)
if len(word_list) >= self.max_sts_len:
origanal_len = self.max_sts_len
word_list = word_list[:self.max_sts_len]
else:
origanal_len = len(word_list)
for i in range(len(word_list), self.max_sts_len):
word_list.append(0) ## 不够长度则补 0
word_list.append(origanal_len) ## 最后一个元素为句子长度
return word_list
Seq2Seq + Attention 模型解读
Seq2Seq 模型可以被认为是一种由编码器和解码器组成的翻译器，其结构如下图所示: 编码器 (Encoder) 和解码器 (Decoder) 通常使用 RNN 构成，为提高效果，RNN 通常使用 LSTM 或 RNN，在上图中的 RNN 即是使用 LSTM。Encoder 将输入翻译为中间状态 C，而 Decoder 将中间状态翻译为输出。序列中每一个时刻的输出由的隐含层状态，前一个时刻的输出值及中间状态 C 共同决定。
Attention 机制
在早先的 Seq2Seq 模型中，中间状态 C 仅由最终的隐层决定，也就是说，源输入中的每个单词对 C 的重要性是一样的。这种方式在一定程度上降低了输出对位置的敏感性。而 Attention 机制正是为了弥补这一缺陷而设计的。在 Attention 机制中，中间状态 C 具有了位置信息，即每个位置的 C 都不相同，第 i 个位置的 C 由下面的公式决定：

公式中，Ci 代表第 i 个位置的中间状态 C，Lx 代表输入序列的全部长度，hj 是第 j 个位置的 Encoder 隐层输出，而 aij 为第 i 个 C 与第 j 个 h 之间的权重。通过这种方式，对于每个位置的源输入就产生了不同的 C，也就是实现了对不同位置单词的‘注意力’。权重 aij 有很多的计算方式，本项目中使用使用小型神经网络进行映射的方式产生 aij。
模型代码实现
Encoder
Encoder 的结构非常简单，是一个简单的 RNN 单元，由于本项目中输入数据是已经编码好的词向量，因此不需要使用 nn.Embedding() 对 input 进行编码。
class Encoder(nn.Module):
def __init__(self, embedding_dim, hidden_dim, num_layers=2, dropout=0.2):
super().__init__()

self.embedding_dim = embedding_dim #词向量维度，本项目中是 200 维
self.hidden_dim = hidden_dim #RNN 隐层维度
self.num_layers = num_layers #RNN 层数
self.dropout = dropout #dropout

self.rnn = nn.GRU(embedding_dim, hidden_dim,
num_layers=num_layers, dropout=dropout)

self.dropout = nn.Dropout(dropout) #dropout 层

def forward(self, input_seqs, input_lengths, hidden=None):
# src = [sent len, batch size]
embedded = self.dropout(input_seqs)
# embedded = [sent len, batch size, emb dim]
packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths) #将输入转换成 torch 中的 pack 格式，使得 RNN 输入的是真实长度的句子而非 padding 后的
#outputs, hidden = self.rnn(packed, hidden)
outputs, hidden = self.rnn(packed)
outputs, output_lengths = torch.nn.utils.rnn.pad_packed_sequence(outputs)
# outputs, hidden = self.rnn(embedded, hidden)
# outputs = [sent len, batch size, hid dim * n directions]
# hidden = [n layers, batch size, hid dim]
# outputs are always from the last layer
return outputs, hidden
Attentation 机制
Attentation 权重的计算方式主要有三种，本项目中使用 concatenate 的方式进行注意力权重的运算。代码实现如下：
class Attention(nn.Module):
def __init__(self, hidden_dim):
super(Attention, self).__init__()
self.hidden_dim = hidden_dim
self.attn = nn.Linear(self.hidden_dim * 2, hidden_dim)
self.v = nn.Parameter(torch.rand(hidden_dim))
self.v.data.normal_(mean=0, std=1. / np.sqrt(self.v.size(0)))

def forward(self, hidden, encoder_outputs):
# encoder_outputs:(seq_len, batch_size, hidden_size)
# hidden:(num_layers * num_directions, batch_size, hidden_size)
max_len = encoder_outputs.size(0)
h = hidden[-1].repeat(max_len, 1, 1)
# (seq_len, batch_size, hidden_size)
attn_energies = self.score(h, encoder_outputs) # compute attention score
return F.softmax(attn_energies, dim=1) # normalize with softmax

def score(self, hidden, encoder_outputs):
# (seq_len, batch_size, 2*hidden_size)-> (seq_len, batch_size, hidden_size)
energy = torch.tanh(self.attn(torch.cat([hidden, encoder_outputs], 2)))
energy = energy.permute(1, 2, 0) # (batch_size, hidden_size, seq_len)
v = self.v.repeat(encoder_outputs.size(1), 1).unsqueeze(1) # (batch_size, 1, hidden_size)
energy = torch.bmm(v, energy) # (batch_size, 1, seq_len)
return energy.squeeze(1) # (batch_size, seq_len)
Decoder
Decoder 同样是一个 RNN 网络，它的输入有三个，分别是句子初始值，hidden tensor 和 Encoder 的 output tensor。在本项目中句子的初始值为‘“’代表的数字 1。由于初始值 tensor 使用的是词袋编码，需要将词袋索引也映射到词向量维度，这样才能与其他 tensor 合并。完整的 Decoder 代码如下所示：
class Decoder(nn.Module):
def __init__(self, output_dim, embedding_dim, hidden_dim, num_layers=2, dropout=0.2):
super().__init__()

self.embedding_dim = embedding_dim ## 编码维度
self.hid_dim = hidden_dim ##RNN 隐层单元数
self.output_dim = output_dim ## 词袋大小
self.num_layers = num_layers ##RNN 层数
self.dropout = dropout

self.embedding = nn.Embedding(output_dim, embedding_dim)
self.attention = Attention(hidden_dim)
self.rnn = nn.GRU(embedding_dim + hidden_dim, hidden_dim,
num_layers=num_layers, dropout=dropout)
self.out = nn.Linear(embedding_dim + hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(dropout)

def forward(self, input, hidden, encoder_outputs):
# input = [bsz]
# hidden = [n layers * n directions, batch size, hid dim]
# encoder_outputs = [sent len, batch size, hid dim * n directions]
input = input.unsqueeze(0)
# input = [1, bsz]
embedded = self.dropout(self.embedding(input))
# embedded = [1, bsz, emb dim]
attn_weight = self.attention(hidden, encoder_outputs)
# (batch_size, seq_len)
context = attn_weight.unsqueeze(1).bmm(encoder_outputs.transpose(0, 1)).transpose(0, 1)
# (batch_size, 1, hidden_dim * n_directions)
# (1, batch_size, hidden_dim * n_directions)
emb_con = torch.cat((embedded, context), dim=2)
# emb_con = [1, bsz, emb dim + hid dim]
_, hidden = self.rnn(emb_con, hidden)
# outputs = [sent len, batch size, hid dim * n directions]
# hidden = [n layers * n directions, batch size, hid dim]
output = torch.cat((embedded.squeeze(0), hidden[-1], context.squeeze(0)), dim=1)
output = F.log_softmax(self.out(output), 1)
# outputs = [sent len, batch size, vocab_size]
return output, hidden, attn_weight
在此之上，定义一个完整的 Seq2Seq 类，将 Encoder 和 Decoder 结合起来。在该类中，有一个叫做 teacher_forcing_ratio 的参数，作用为在训练过程中强制使得网络模型的输出在一定概率下更改为 ground truth，这样在反向传播时有利于模型的收敛。该类中有两个方法，分别在训练和预测时应用。Seq2Seq 类名称为 Net，代码如下所示：
class Net(nn.Module):
def __init__(self, encoder, decoder, device, teacher_forcing_ratio=0.5):
super().__init__()

self.encoder = encoder.to(device)
self.decoder = decoder.to(device)
self.device = device
self.teacher_forcing_ratio = teacher_forcing_ratio

def forward(self, src_seqs, src_lengths, trg_seqs):
# src_seqs = [sent len, batch size]
# trg_seqs = [sent len, batch size]
batch_size = src_seqs.shape[1]
max_len = trg_seqs.shape[0]
trg_vocab_size = self.decoder.output_dim
# tensor to store decoder outputs
outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
# hidden used as the initial hidden state of the decoder
# encoder_outputs used to compute context
encoder_outputs, hidden = self.encoder(src_seqs, src_lengths)
# first input to the decoder is the <sos> tokens
output = trg_seqs[0, :]

for t in range(1, max_len): # skip sos
output, hidden, _ = self.decoder(output, hidden, encoder_outputs)
outputs[t] = output
teacher_force = random.random() < self.teacher_forcing_ratio
output = (trg_seqs[t] if teacher_force else output.max(1)[1])
return outputs

def predict(self, src_seqs, src_lengths, max_trg_len=30, start_ix=1):
max_src_len = src_seqs.shape[0]
batch_size = src_seqs.shape[1]
trg_vocab_size = self.decoder.output_dim
outputs = torch.zeros(max_trg_len, batch_size, trg_vocab_size).to(self.device)
encoder_outputs, hidden = self.encoder(src_seqs, src_lengths)
output = torch.LongTensor([start_ix] * batch_size).to(self.device)
attn_weights = torch.zeros((max_trg_len, batch_size, max_src_len))
for t in range(1, max_trg_len):
output, hidden, attn_weight = self.decoder(output, hidden, encoder_outputs)
outputs[t] = output
output = output.max(1)[1]
#attn_weights[t] = attn_weight
return outputs, attn_weights
训练神经网络
训练过程包括定义损失函数，优化器，数据处理，梯队下降等过程。由于网络中 tensor 型状为(sentence len, batch, embedding), 而加载的数据形状为(batch, sentence len, embedding)，因此有些地方需要进行转置。
定义网络，辅助类等代码如下所示：
# 数据获取辅助类
data = Dataset()
en=Encoder(200,64) ## 词向量维度 200，rnn 隐单元 64
de=Decoder(9133,200,64) ## 词袋大小 9133，词向量维度 200，rnn 隐单元 64
network = Net(en,de,device) ## 定义 Seq2Seq 实例
loss_fn = nn.CrossEntropyLoss() ## 使用交叉熵损失函数

optimizer = Adam(network.parameters()) ## 使用 Adam 优化器

model = Model(data)
训练过程如下所示：
lowest_loss = 10
# 得到训练和测试的数据
for epoch in range(args.EPOCHS):
network.train()

# 得到训练和测试的数据
x_train, y_train, x_test, y_test = data.next_batch(args.BATCH) # 读取数据; shape:(sen_len,batch,embedding)
#x_train shape: (batch,sen_len,embed_dim)
#y_train shape: (batch,sen_len)
batch_len = y_train.shape[0]
#input_lengths = [30 for i in range(batch_len)] ## batch 内每个句子的长度
input_lengths = x_train[:,-1,0]
input_lengths = input_lengths.tolist()
#input_lengths = list(map(lambda x: int(x),input_lengths))
input_lengths = [int(x) for x in input_lengths]
y_lengths = y_train[:,-1]
y_lengths = y_lengths.tolist()

x_train = x_train[:,:-1,:] ## 除去长度信息
x_train = torch.from_numpy(x_train) #shape:(batch,sen_len,embedding)
x_train = x_train.float().to(device)
y_train = y_train[:,:-1] ## 除去长度信息
y_train = torch.from_numpy(y_train) #shape:(batch,sen_len)
y_train = torch.LongTensor(y_train)
y_train = y_train.to(device)

seq_pairs = sorted(zip(x_train.contiguous(), y_train.contiguous(),input_lengths), key=lambda x: x[2], reverse=True)
#input_lengths = sorted(input_lengths, key=lambda x: input_lengths, reverse=True)
x_train, y_train,input_lengths = zip(*seq_pairs)
x_train = torch.stack(x_train,dim=0).permute(1,0,2).contiguous()
y_train = torch.stack(y_train,dim=0).permute(1,0).contiguous()

outputs = network(x_train,input_lengths,y_train)

#_, prediction = torch.max(outputs.data, 2)

optimizer.zero_grad()
outputs = outputs.float()
# calculate the loss according to labels
loss = loss_fn(outputs.view(-1, outputs.shape[2]), y_train.view(-1))

# backward transmit loss
loss.backward()
# adjust parameters using Adam
optimizer.step()
print(loss)

# 若测试准确率高于当前最高准确率，则保存模型

if loss < lowest_loss:
lowest_loss = loss
model.save_model(network, MODEL_PATH, overwrite=True)
print(“step %d, best lowest_loss %g” % (epoch, lowest_loss))
print(str(epoch) + “/” + str(args.EPOCHS))
小结
通过使用 Seq2Seq + Attention 模型，我们完成了使用神经网络对对联的任务。经过十余个周期的训练后，神经网络将会对出与上联字数相同的下联，但是，若要对出工整的对联，还需训练更多的周期，读者也可以尝试其他的方法来提高对仗的工整性。

体验对对联 Demo: https://www.flyai.com/couplets

获取更多项目样例开源代码请 PC 端访问：www.flyai.com