入门NLP-task2-数据探索

–真的没有图片–

非结构化数据的数据探索不像结构化数据，结构化数据可以通过数据探索得到很多有用的信息，非结构化数据的数据探索得到的信息有限。

仅仅能够得到字符出现的频率、次数，新闻的长度等等

先观察新闻长度

1 2	train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' '))) train_df['text_len'].describe()

从图中可以看出，分布比较极端，但是更多的都分布在1000左右个字符

观察新闻种类数量

1	train_df['label'].value_counts().plot(kind='bar')

可以看出新闻类别的数量也是不均衡的，类别不均衡将会影响到模型的训练结果。

出现最多的字符

from collections import Counter
all_lines = ''.join(list(train_df['text']))
word_count = Counter(all_lines.split(' '))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse=True)

word_count, word_count[0], word_count[-1]

出现最多的是，‘3750’ 共出现了7482207，其次是‘648’出现了4924852，再其次是‘900’出现了3177505。并且由于这三个字符在每篇新闻中的覆盖率很高，我们有理由认为是三个标点符号。因此如果把这三个字符当作标点符号，那么，每篇新闻平均有78个句子左右。

每种新闻类别出现频率最高的字符

for n in train_df['label'].unique().tolist():
	data = train_df[train_df['label']==n]
	all_lines = ''.join(list(data['text']))
	word_count = Counter(all_lines.split(' '))
	word_count = sorted(word_count.items(), key=lambda d:d[1], 		reverse=True)

	print('新闻种类：',n, word_count[0:10])

新闻种类： 2 [('7399', 351887), ('6122', 343758), ('4939', 337756)]
新闻种类： 11 [ ('4939', 18591), ('6122', 18432), ('5560', 17933)]
新闻种类： 3 [ ('6122', 187922), ('4939', 173606), ('4893', 148767)]
新闻种类： 9 [('7328', 46426), ('6122', 43395), ('7399', 37560)]
新闻种类： 10 [ ('3370', 67775), ('2465', 44969), ('5560', 42447)]
新闻种类： 12 [('4464', 51393), ('3370', 45793),  ('2465', 36589))]
新闻种类： 0 [('3370', 503448), ('4464', 306148), ('2465', 294242)]
新闻种类： 7 [('3370', 159142), ('5296', 132054), ('4464', 113117)]
新闻种类： 4 [ ('4411', 120131), ('7399', 86180), ('4893', 77408)]
新闻种类： 1 [ ('3370', 626663), ('900', 526300), ('4464', 445289)]
新闻种类： 6 [ ('6248', 193728), ('2555', 174927), ('5620', 156911)]
新闻种类： 5 [('6122', 159097), ('5598', 136710), ('4893', 130550)]
新闻种类： 8 [ ('6122', 57267), ('4939', 56147), ('913', 55199)]
新闻种类： 13 [('4939', 9651), ('669', 8923), ('6122', 8321)]