数据下载

下载数据

!wget https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/train_set.csv.zip!wget https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/test_a.csv.zip!wget https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/test_a_sample_submit.csv

解压数据, 共蕴含3个文件, 训练数据(train_set.csv), 测试数据(test_a.csv), 后果提交样例文件(test_a_sample_submit.csv)

!mkdir /content/drive/My\ Drive/competitions/NLPNews!unzip /content/test_a.csv.zip -d /content/drive/My\ Drive/competitions/NLPNews/test!unzip /content/train_set.csv.zip -d /content/drive/My\ Drive/competitions/NLPNews/train!mv /content/test_a_sample_submit.csv /content/drive/My\ Drive/competitions/NLPNews/submit.csv!mv /content/drive/My\ Drive/competitions/NLPNews/test/test_a.csv /content/drive/My\ Drive/competitions/NLPNews/test.csv!mv /content/drive/My\ Drive/competitions/NLPNews/test/train_set.csv /content/drive/My\ Drive/competitions/NLPNews/train.csv

读取数据

import pandas as pdimport osfrom collections import Counterimport matplotlib.pyplot as plt%matplotlib inline

root_dir = '/content/drive/My Drive/competitions/NLPNews'

train_df = pd.read_csv(root_dir+'/train.csv', sep='\t')train_df['word_cnt'] = train_df['text'].apply(lambda x: len(x.split(' ')))train_df.head(10)
labeltextword_cnt
022967 6758 339 2021 1854 3731 4109 3792 4149 15...1057
1114464 486 6352 5619 2465 4802 1452 3137 5778 54...486
237346 4068 5074 3747 5681 6093 1777 2226 7354 6...764
327159 948 4866 2109 5520 2490 211 3956 5520 549...1570
433646 3055 3055 2490 4659 6065 3370 5814 2465 5...307
593819 4525 1129 6725 6485 2109 3800 5264 1006 4...1050
63307 4780 6811 1580 7539 5886 5486 3433 6644 58...267
71026 4270 1866 5977 3523 3764 4464 3659 4853 517...876
8122708 2218 5915 4559 886 1241 4819 314 4261 166...314
933654 531 1348 29 4553 6722 1474 5099 7541 307 ...1086

查看数据

train_df['word_cnt'] = train_df['word_cnt'].apply(int)train_df['word_cnt'].describe()
count 200000.000000
mean 907.207110
std 996.029036
min 2.000000
25% 374.000000
50% 676.000000
75% 1131.000000
max 57921.000000
Name: word_cnt, dtype: float64
plt.hist(train_df['word_cnt'], bins=255)plt.title('word counts statistics')plt.xlabel('word counts')plt.show()

plt.bar(range(1, 15), train_df['label'].value_counts().values)plt.title('label counts statistic')# plt.xticks(range(1, 15), labels=labels)plt.xlabel('label')plt.show()

labels = ['科技', '股票', '体育', '娱乐', '时政', '社会', '教育', '财经', '家居', '游戏', '房产', '时尚', '彩票', '星座']for label, cnt in zip(labels, train_df['label'].value_counts()):  print(label, cnt)
科技 38918
股票 36945
体育 31425
娱乐 22133
时政 15016
社会 12232
教育 9985
财经 8841
家居 7847
游戏 5878
房产 4920
时尚 3131
彩票 1821
星座 908
s = ' '.join(list(train_df['text']))counter = Counter(s.split(' '))counter = sorted(counter.items(), key=lambda x: x[1], reverse=True)print('the most occured words: ', counter[0])print('the less occured words: ', counter[-1])

Reference

[1] Datawhale零根底入门NLP赛事 - Task2 数据读取与数据分析