我们需要开始思考如何将文本集合转化为可量化的东西。最简单的方法是考虑词频。
我将尽量尝试不使用NLTK和Scikits-Learn包。我们首先使用Python讲解一些基本概念。
基本词频
首先,我们回顾一下如何得到每篇文档中的词的个数:一个词频向量。
#examples taken from here: http://stackoverflow.com/a/1750187 mydoclist = ['Julie loves me more than Linda loves me','Jane likes me more than Julie loves me','He likes basketball more than baseball'] #mydoclist = ['sun sky bright', 'sun sun bright'] from collections import Counter for doc in mydoclist: tf = Counter() for word in doc.split(): tf[word] +=1 print tf.items()[('me', 2), ('Julie', 1), ('loves', 2), ('Linda', 1), ('than', 1), ('more', 1)][('me', 2), ('Julie', 1), ('likes', 1), ('loves', 1), ('Jane', 1), ('than', 1), ('more', 1)][('basketball', 1), ('baseball', 1), ('likes', 1), ('He', 1), ('than', 1), ('more', 1)]
这里我们引入了一个新的Python对象,被称作为Counter。该对象只在Python2.7及更高的版本中有效。Counters非常的灵活,利用它们你可以完成这样的功能:在一个循环中进行计数。
根据每篇文档中词的个数,我们进行了文档量化的第一个尝试。但对于那些已经学过向量空间模型中“向量”概念的人来说,第一次尝试量化的结果不能进行比较。这是因为它们不在同一词汇空间中。
我们真正想要的是,每一篇文件的量化结果都有相同的长度,而这里的长度是由我们语料库的词汇总量决定的。
import string #allows for format() def build_lexicon(corpus): lexicon = set() for doc in corpus: lexicon.update([word for word in doc.split()]) return lexicon def tf(term, document): return freq(term, document) def freq(term, document): return document.split().count(term) vocabulary = build_lexicon(mydoclist) doc_term_matrix = []print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'for doc in mydoclist: print 'The doc is "' + doc + '"' tf_vector = [tf(word, doc) for word in vocabulary] tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector) print 'The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string) doc_term_matrix.append(tf_vector) # here's a test: why did I wrap mydoclist.index(doc)+1 in parens? it returns an int... # try it! type(mydoclist.index(doc) + 1) print 'All combined, here is our master document term matrix: 'print doc_term_matrix
我们的词向量为[me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]
文档”Julie loves me more than Linda loves me”的词频向量为:[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1]
文档”Jane likes me more than Julie loves me”的词频向量为:[2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1]
新闻热点
疑难解答