乐趣区

关于人工智能:word2vecxgb-句子分类-简单例子

数据

3 万文本,train val test 6 2 2.

工具、手法

xgboost、sklearn、gensim 的 word2vec。
word2vec 嵌入词,词间接 sum 掉词失去“句向量”,后用 xgb 对句向量分类。

代码

import jieba
import xgboost as xgb
from sklearn.model_selection import train_test_split
import numpy as np
from gensim.models import Word2Vec


# reorganize data
def get_split_sentences(file_path):
    res_sen=[]
    with open(file_path) as f:
        for line in f:
            split_query=jieba.lcut(line.strip())
            res_sen.append(split_query)
    return res_sen

label2_sentences=get_split_sentences('label2.csv')
label0_sentences=get_split_sentences('label0.csv')
label1_sentences=get_split_sentences('label1.csv')

all_sentences=[]
all_sentences.extend(label0_sentences)
all_sentences.extend(label1_sentences)
all_sentences.extend(label2_sentences)

# set params
emb_size=128
win=3
model=Word2Vec(sentences=all_sentences,vector_size=emb_size,window=win,min_count=1)
# retrieve word embeddings
w2vec=model.wv

# assemble sentence embeddings
def assemble_x(w2vec:dict,sentences):
    sen_vs=[]
    for sen in sentences:
        max_len=max(max_len,len(sen))
        v=np.vstack([w2vec[w] for w in sen])
        sen_v=v.mean(axis=0)
        sen_vs.append(sen_v)
    return np.array(sen_vs)

# ready the data for training
x=assemble_x(w2vec,all_sentences,False)
y=np.array([0]*13000+[1]*13000+[2]*4000)
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.6,shuffle=True)
x_val,x_test,y_val,y_test=train_test_split(x_test,y_test,train_size=0.5,shuffle=True)



dtrain=xgb.DMatrix(x_train,y_train)
dval=xgb.DMatrix(x_val,y_val)
dtest=xgb.DMatrix(x_test,y_test)

params={
    'booster': 'gbtree',
    'objective': 'multi:softmax',  
    'num_class': 3,               
    'max_depth': 20,              
}
evals=[(dtrain,'train'),(dval,'vaild')]
model=xgb.train(params,dtrain=dtrain,evals=evals)
preds=model.predict(dtest)

def get_scores(preds,gt):
    from sklearn import metrics
    # print ('AUC: %.4f' % metrics.roc_auc_score(gt,preds))
    print ('ACC: %.4f' % metrics.accuracy_score(gt,preds))
    print('macro')
    print('Recall: %.4f' % metrics.recall_score(y_test,preds,average='macro'))
    print('F1-score: %.4f' %metrics.f1_score(gt,preds,average='macro'))
    print('Precision: %.4f' %metrics.precision_score(gt,preds,average='macro'))

    print('\nmicro:')
    print('Recall: %.4f' % metrics.recall_score(y_test,preds,average='micro'))
    print('F1-score: %.4f' %metrics.f1_score(gt,preds,average='micro'))
    print('Precision: %.4f' %metrics.precision_score(gt,preds,average='micro'))

get_scores(preds,y_test)

后果

ACC: 0.9402

macro
Recall: 0.9330
F1-score: 0.9391
Precision: 0.9459

micro:
Recall: 0.9402
F1-score: 0.9402
Precision: 0.9402

小结

即使是十分粗略的进行 embedding 的相加成的句向量,也能够达到 94% 左右的问题,1 是因为工作自身简略,2 是因为 xgb boosting 成果好。

退出移动版