关于nlp:零基础入门NLP-基于机器学习的文本分类

任何分类问题, 都须要从数据中开掘有用的特色, 文本分类也不例外. 这里会介绍几种从文本中提取特色的形式. 也是解决文本最根底的办法.

在机器学习算法的训练过程中，假如给定 $N$ 个样本，每个样本有 $M$ 个特色，这样组成了 $N×M$ 的样本矩阵，而后实现算法的训练和预测。同样的在计算机视觉中能够将图片的像素看作特色，每张图片看作 hight×width×3 的特色图，一个三维的矩阵来进入计算机进行计算。

然而在自然语言畛域，上述办法却不可行：文本是不定长度的。文本示意成计算机可能运算的数字或向量的办法个别称为词嵌入（Word Embedding）办法。词嵌入将不定长的文本转换到定长的空间内，是文本分类的第一步。

one-hot 通常被用来编码不同类别, 一个编码的每一位对应一个类别, 且只有其中一位是 1, 其余均为 0. 依照雷同的思维, 咱们也能够用 one-hot 编码来示意每一个单词. 比方上面两句话

句子 1：我爱北京天安门
句子 2：我喜欢上海

首先会统计两句话中的所有字的类别, 并将每个类别编号

{
‘ 我 ’: 1, ‘ 爱 ’: 2, ‘ 北 ’: 3, ‘ 京 ’: 4, ‘ 天 ’: 5,
‘ 安 ’: 6, ‘ 门 ’: 7, ‘ 喜 ’: 8, ‘ 欢 ’: 9, ‘ 上 ’: 10, ‘ 海 ’: 11
}

在这里共包含 11 个字，因而每个字能够转换为一个 11 维度稠密向量：

我：[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
爱：[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
…
海：[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

这种思路看似是正当的, 但存在显著的 2 个问题

对于一个稍简单的语料数据, 就曾经蕴含宏大数量的词, 并且一个词还会有多种形式, 如果每个词都用一个 one-hot 编码向量示意, 会导致维度爆炸.
one-hot 向量无奈建模单词之间 (one-hot 向量互相正交) 的关系, 然而这种信息是文本中重要的特色.

bag of words(BoW)也叫词袋模型, 是一种从文本中提取特色用于建模的办法.

词袋模型是一种形容一个文档中的单词呈现的文本示意, 它次要包含

一个已有单词的词典.
已有单词示意的度量.

之所以被称为词袋, 因为 BoW 只关怀已知单词在文档中是否呈现, 并不关怀它在文档中呈现的程序和构造信息. 它将每个词在文档中的计数作为特色.

构建一个 BoW 模型包含以下几个步骤

收集数据
比方

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
设计词典
能够将文档库 (收集的数据) 中本人认为重要的单词退出到词典中, 词典的模式如下

“it”
“was”
“the”
“best”
“of”
“times”
“worst”
“age”
“wisdom”
“foolishness”
创立文档向量
这一步的目标是将每个文档 (能够了解成蕴含不定长度单词的句子) 转换为一个固定长度的向量, 向量的长度为词典中单词的个数.
那么如何将文档转换为单个向量呢, 最简略的形式就是, 应用一个布尔值来示意词典中每个词是否在文档中是否呈现, 呈现了即为 1, 否则为 0
比方下面的一个文档失去的向量为

“it”= 1
“was”= 1
“the”= 1
“best”= 1
“of”= 1
“times”= 1
“worst”= 0
“age”= 0
“wisdom”= 0
“foolishness”= 0

对应向量为:

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

在 sklearn 中, 咱们能够利用自带的工具疾速的实现 BoW 的性能

 from sklearn.feature_extraction.text import CountVectorizer
corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
]
# 将每个单词在词典中呈现的次数作为特色
counter = CountVectorizer()
vectors = counter.fit_transform(corpus)

简略来说, N-gram 模型依据前 N - 1 个已有的单词来预测一个单词呈现的概率, 但 N =2(N-1=1)时, 即一个单词呈现的概率仅由它的前一个单词决定.
那么如何依据 N - 1 个已呈现的单词来预测一个单词的呈现呢?
首先, 咱们须要一个语料库(corpus), 蕴含了大量的句子. 假如当初语料库蕴含了如下的句子

1.He said thank you.
2.He said bye as he walked through the door.
3.He went to San Diego.
4.San Diego has nice weather.
5.It is raining in San Francisco.

假如咱们设置 N 为 2, 即只依据它前一个词来进行预测单词呈现的概率. 通常而言, 概率的计算形式如下
$\frac{count(wp wn)}{count(wp)}$, wp 示意上一个单词, wn 示意以后单词, count 为计数函数.

比方咱们要失去 you 呈现在thank 之后的概率P(you|thank), 它等同于

occurence times of “thank you” / occurence times of “thank”
= 1 / 1
= 1

咱们能够说, 无论什么时候呈现了 thank, you 都会呈现在它前面.

TF-IDF(term frequency-inverse document frequency), 它是一种统计度量的办法, 用来评估一个单词对于文档库中的一个文档的相干水平. 它在信息检索和文本开掘常常被应用.
对于一个文档中的一个单词, 它的 TF-IDF 能够通过乘以两个不同的指标来失去

term-frequency(TF): $TF(t) = \frac{count(t)}{total \quad terms} =\frac{单词 t 在以后文档呈现的次数}{以后文档中总的单词数}$
inverse document frequency(IDF): $IDF(t)=In(\frac{count(document)}{count(document\quad which\quad contain\quad term\quad t)})=In(\frac{总的文档数目}{蕴含单词 t 的文档数目})$

比方一个文档中蕴含 100 个词, 单词 cat 呈现了 3 次, 则 TF(cat)=3/100=0.03, 假如咱们有 1e7 个文档, cat在其中的 1e3 个中呈现了, 则 IDF(cat)=log(1e7/1e3)=4, 因而 TF_IDF 权重为: 0.03 * 4 = 0.12.

当初回到比赛的数据中去, 尝试应用 TF-IDF 来构建特色进行分类

 import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import  f1_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
 
root_dir = '/content/drive/My Drive/competitions/NLPNews'
 
# 内存无限, 这里只读取 10000 行
train_df = pd.read_csv(root_dir+'/train.csv', sep='\t', nrows=10000)
 
# max_features 示意词典的大小, 蕴含词频最高的 max_features 个词
tfidf = TfidfVectorizer(ngram_range=(1, 3), max_features=3000)
train_test = tfidf.fit_transform(train_df['text'])
 
# 构建分类器
clf = RidgeClassifier()
 
# 切分数据集
x_train, x_test, y_train, y_test = train_test_split(train_test, train_df['label'], test_size=0.1, random_state=0)
 
# 训练模型
clf.fit(x_train, y_train)
 
# 执行预测
y_pred = clf.predict(x_test)
 
# 输入宏均匀 f1-score
print(f1_score(y_test, y_pred, average='macro'))

0.8802400152512864

通过本次的学习, 对于文本的示意办法以及文本数据集的特色构建有了一个根本的理解.

[1] Datawhale 零根底入门 NLP 赛事 – Task3 基于机器学习的文本分类)
[2] A Gentle Introduction to the Bag-of-Words Model
[3] An Introduction to N-grams: What Are They and Why Do We Need Them?
[4] what does tf-idf mean?

关于nlp:零基础入门NLP-基于机器学习的文本分类

文本示意办法

One-hot 独热标签

Bag of Words

N-gram

TF-IDF

总结

Reference

Just My Socks（注册教程内含优惠码）

	from sklearn.feature_extraction.text import CountVectorizer
	corpus = [
	'This is the first document.',
	'This document is the second document.',
	'And this is the third one.',
	'Is this the first document?',
	]
	# 将每个单词在词典中呈现的次数作为特色
	counter = CountVectorizer()
	vectors = counter.fit_transform(corpus)

	import pandas as pd
	from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
	from sklearn.metrics import f1_score
	from sklearn.model_selection import train_test_split
	from sklearn.linear_model import RidgeClassifier

	root_dir = '/content/drive/My Drive/competitions/NLPNews'

	# 内存无限, 这里只读取 10000 行
	train_df = pd.read_csv(root_dir+'/train.csv', sep='\t', nrows=10000)

	# max_features 示意词典的大小, 蕴含词频最高的 max_features 个词
	tfidf = TfidfVectorizer(ngram_range=(1, 3), max_features=3000)
	train_test = tfidf.fit_transform(train_df['text'])

	# 构建分类器
	clf = RidgeClassifier()

	# 切分数据集
	x_train, x_test, y_train, y_test = train_test_split(train_test, train_df['label'], test_size=0.1, random_state=0)

	# 训练模型
	clf.fit(x_train, y_train)

	# 执行预测
	y_pred = clf.predict(x_test)

	# 输入宏均匀 f1-score
	print(f1_score(y_test, y_pred, average='macro'))

关于nlp:零基础入门NLP-基于机器学习的文本分类

文本示意办法

One-hot 独热标签

Bag of Words

N-gram

TF-IDF

总结

Reference

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）