关于人工智能:基于LSTM和词嵌入的tweet文本分类

作者|Emmanuella Anggi
编译|VK
起源|Towards Data Science

在这篇文章中，我将具体介绍如何应用fastText和GloVe作单词嵌入到LSTM模型上进行文本分类。

我在写对于自然语言生成的论文时对词嵌入产生了趣味。词嵌入进步了模型的性能。在本文中，我想看看每种办法（有fastText和GloVe以及不应用）对预测的影响。

在我的Github代码中，我还将后果与CNN进行了比拟。我在这里应用的数据集来自Kaggle，由tweets组成，标签是表明推特是否是灾难性推特（形容劫难的推特）。说实话，在第一次看到这个数据集时，我立即想到了BERT，它的理解能力比我在本文中提出的更好(进一步浏览BERT)。

但无论如何，在本文中，我将重点介绍fastText和GloVe。

数据+预处理

数据包含7613条tweet（Text列）和label（Target列），不论他们是否在议论真正的劫难。有3271行告诉理论劫难，有4342行告诉非理论劫难。如果你想理解更多对于数据的信息，能够在这里浏览。

链接：https://www.kaggle.com/c/nlp-...

文本中实在劫难词的例子：

“ Forest fire near La Ronge Sask. Canada “

应用劫难词而不是对于劫难的例子：

“These boxes are ready to explode! Exploding Kittens finally arrived! gameofkittens #explodingkittens”

数据将被分成训练（6090行）和测试（1523行）集，而后进行预处理。咱们将只应用文本列和指标列。

from sklearn.model_selection import train_test_splitdata = pd.read_csv('train.csv', sep=',', header=0)train_df, test_df = train_test_split(data, test_size=0.2, random_state=42, shuffle=True)

此处应用的预处理步骤：

小写
革除停用词
标记化

from sklearn.utils import shuffleraw_docs_train = train_df['text'].tolist()raw_docs_test = test_df['text'].tolist()num_classes = len(label_names)processed_docs_train = []for doc in tqdm(raw_docs_train):  tokens = word_tokenize(doc)  filtered = [word for word in tokens if word not in stop_words]  processed_docs_train.append(" ".join(filtered))processed_docs_test = []for doc in tqdm(raw_docs_test):  tokens = word_tokenize(doc)  filtered = [word for word in tokens if word not in stop_words]  processed_docs_test.append(" ".join(filtered))tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True, char_level=False)tokenizer.fit_on_texts(processed_docs_train + processed_docs_test)  word_seq_train = tokenizer.texts_to_sequences(processed_docs_train)word_seq_test = tokenizer.texts_to_sequences(processed_docs_test)word_index = tokenizer.word_indexword_seq_train = sequence.pad_sequences(word_seq_train, maxlen=max_seq_len)word_seq_test = sequence.pad_sequences(word_seq_test, maxlen=max_seq_len)

词嵌入

第1步：下载预训练模型

应用fastText和Glove的第一步是下载每个预训练过的模型。我应用google colab来避免我的笔记本电脑应用大内存，所以我用request library下载了它，而后间接在notebook上解压。

我应用了两个词嵌入中最大的预训练模型。fastText模型给出了200万个词向量，而GloVe给出了220万个单词向量。

fastText预训练模型下载

import requests, zipfile, iozip_file_url = “https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip"r = requests.get(zip_file_url)z = zipfile.ZipFile(io.BytesIO(r.content))z.extractall()

GloVe预训练模型下载

import requests, zipfile, iozip_file_url = “http://nlp.stanford.edu/data/glove.840B.300d.zip"r = requests.get(zip_file_url)z = zipfile.ZipFile(io.BytesIO(r.content))z.extractall()

第2步：下载预训练模型

FastText提供了加载词向量的格局，须要应用它来加载这两个模型。

embeddings_index = {}f = codecs.open(‘crawl-300d-2M.vec’, encoding=’utf-8')# Glove# f = codecs.open(‘glove.840B.300d.txt’, encoding=’utf-8')for line in tqdm(f):    values = line.rstrip().rsplit(‘ ‘)    word = values[0]    coefs = np.asarray(values[1:], dtype=’float32')    embeddings_index[word] = coefsf.close()

第3步：嵌入矩阵

采纳嵌入矩阵来确定训练数据中每个词的权重。

然而有一种可能性是，有些词不在向量中，比方打字谬误、缩写或用户名。这些单词将存储在一个列表中，咱们能够比拟解决来自fastText和GloVe的词的性能

words_not_found = []nb_words = min(MAX_NB_WORDS, len(word_index)+1)embedding_matrix = np.zeros((nb_words, embed_dim))for word, i in word_index.items():  if i >= nb_words:     continue  embedding_vector = embeddings_index.get(word)    if (embedding_vector is not None) and len(embedding_vector) > 0:     embedding_matrix[i] = embedding_vector  else:     words_not_found.append(word)print('number of null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

fastText上的null word嵌入数为9175，GloVe 上的null word嵌入数为9186。

LSTM

你能够对超参数或架构进行微调，但我将应用非常简单的一个架构，它蕴含嵌入层、LSTM层、Dense层和Dropout层。

from keras.layers import BatchNormalizationimport tensorflow as tfmodel = tf.keras.Sequential()model.add(Embedding(nb_words, embed_dim, input_length=max_seq_len, weights=[embedding_matrix],trainable=False))model.add(Bidirectional(LSTM(32, return_sequences= True)))model.add(Dense(32,activation=’relu’))model.add(Dropout(0.3))model.add(Dense(1,activation=’sigmoid’))model.summary()

from keras.optimizers import RMSpropfrom keras.callbacks import ModelCheckpointfrom tensorflow.keras.callbacks import EarlyStoppingmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])es_callback = EarlyStopping(monitor='val_loss', patience=3)history = model.fit(word_seq_train, y_train, batch_size=256, epochs=30, validation_split=0.3, callbacks=[es_callback], shuffle=False)

后果

fastText的准确率为83%，而GloVe的准确率为81%。与没有词嵌入的模型（68%）的性能比拟，能够看出词嵌入对性能有显著的影响。

fastText 嵌入的准确度

GloVe 嵌入的准确度

没有词嵌入的准确度

如果你想将代码其利用于其余数据集，能够在Github上看到残缺的代码。

Github上残缺代码:https://github.com/emmanuella...。

原文链接：https://towardsdatascience.co...

欢送关注磐创AI博客站：
http://panchuang.net/

sklearn机器学习中文官网文档：
http://sklearn123.com/

欢送关注磐创博客资源汇总站：
http://docs.panchuang.net/