【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

一、前述

Python上著名的⾃然语⾔处理库⾃带语料库，词性分类库⾃带分类，分词，等等功能强⼤的社区⽀持，还有N多的简单版wrapper。

二、文本预处理

1、安装nltk

pip install -U nltk

安装语料库 (一堆对话，一对模型)

import nltk nltk.download()

【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

2、功能一览表：

【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

3、文本处理流程

【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

4、Tokenize 把长句⼦拆成有“意义”的⼩部件

【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

import jieba seg_list = jieba.cut("我来到北北京清华⼤大学", cut_all=True) print "Full Mode:", "/ ".join(seg_list) # 全模式
seg_list = jieba.cut("我来到北北京清华⼤大学", cut_all=False) print "Default Mode:", "/ ".join(seg_list) # 精确模式
seg_list = jieba.cut("他来到了了⽹网易易杭研⼤大厦") # 默认是精确模式
print ", ".join(seg_list) seg_list = jieba.cut_for_search("⼩小明硕⼠士毕业于中国科学院计算所，后在⽇日本京都⼤大学深造") # 搜索引擎模式
print ", ".join(seg_list)

结果：

【全模式】: 我/ 来到/ 北北京/ 清华/ 清华⼤大学/ 华⼤大/ ⼤大学 【精确模式】: 我/ 来到/ 北北京/ 清华⼤大学 【新词识别】：他, 来到, 了了, ⽹网易易, 杭研, ⼤大厦 (此处，“杭研”并没有在词典中，但是也被Viterbi算法识别出来了了) 【搜索引擎模式】： ⼩小明, 硕⼠士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, ⽇日本, 京都, ⼤大学, ⽇日本京都⼤大学, 深造

社交⽹络语⾔的tokenize:

import re emoticons_str = r""" (?: [:=;] # 眼睛 [oO\-]? # ⿐鼻⼦子 [D\)\]\(\]/\\OpP] # 嘴 )""" regex_str = [ emoticons_str, r'<[^>]+>', # HTML tags
r'(?:@[\w_]+)', # @某⼈人
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # 话题标签
r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)', # 数字
r"(?:[a-z][a-z'\-_]+[a-z])", # 含有 - 和 ‘ 的单词
r'(?:[\w_]+)', # 其他
r'(?:\S)' # 其他
]

正则表达式对照表
http://www.regexlab.com/zh/regref.htm

这样能处理社交语言中的表情等符号：

tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE) emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE) def tokenize(s): return tokens_re.findall(s) def preprocess(s, lowercase=False): tokens = tokenize(s) if lowercase: tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens] return tokens tweet = 'RT @angelababy: love you baby! :D http://ah.love #168cm'
print(preprocess(tweet)) # ['RT', '@angelababy', ':', 'love', 'you', 'baby', # ’!', ':D', 'http://ah.love', '#168cm']

5、词形归⼀化

Stemming 词⼲提取：⼀般来说，就是把不影响词性的inflection的⼩尾巴砍掉
walking 砍ing = walk
walked 砍ed = walk
Lemmatization 词形归⼀：把各种类型的词的变形，都归为⼀个形式
went 归⼀ = go
are 归⼀ = be

>>> from nltk.stem.porter import PorterStemmer >>> porter_stemmer = PorterStemmer() >>> porter_stemmer.stem(‘maximum’) u’maximum’ >>> porter_stemmer.stem(‘presumably’) u’presum’ >>> porter_stemmer.stem(‘multiply’) u’multipli’ >>> porter_stemmer.stem(‘provision’) u’provis’ >>> from nltk.stem import SnowballStemmer >>> snowball_stemmer = SnowballStemmer(“english”) >>> snowball_stemmer.stem(‘maximum’) u’maximum’ >>> snowball_stemmer.stem(‘presumably’) u’presum’ >>> from nltk.stem.lancaster import LancasterStemmer >>> lancaster_stemmer = LancasterStemmer() >>> lancaster_stemmer.stem(‘maximum’) ‘maxim’ >>> lancaster_stemmer.stem(‘presumably’) ‘presum’ >>> lancaster_stemmer.stem(‘presumably’) ‘presum’ >>> from nltk.stem.porter import PorterStemmer >>> p = PorterStemmer() >>> p.stem('went') 'went'
>>> p.stem('wenting') 'went'

6、词性Part-Of-Speech

【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

>>> import nltk >>> text = nltk.word_tokenize('what does the fox say') >>> text ['what', 'does', 'the', 'fox', 'say'] >>> nltk.pos_tag(text) [('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]

7、Stopwords

⾸先记得在console⾥⾯下载⼀下词库
或者 nltk.download(‘stopwords’)

from nltk.corpus import stopwords # 先token⼀一把，得到⼀一个word_list # ... # 然后filter⼀一把
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

8、⼀条⽂本预处理流⽔线

【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

三、自然语言处理应用。

实际上预处理就是将文本转换为Word_List，自然语言处理再转变成计算机能识别的语言。

【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

自然语言处理有以下几个应用：情感分析，⽂本相似度，⽂本分类

1、情感分析

最简单的 sentiment dictionary,类似于关键词打分机制.

like 1
good 2
bad -2
terrible -3

sentiment_dictionary = {} for line in open('data/AFINN-111.txt') word, score = line.split('\t') sentiment_dictionary[word] = int(score) # 把这个打分表记录在⼀一个Dict上以后 # 跑⼀一遍整个句句⼦子，把对应的值相加
total_score = sum(sentiment_dictionary.get(word, 0) for word in words) # 有值就是Dict中的值，没有就是0 # 于是你就得到了了⼀一个 sentiment score

显然这个⽅法太Naive,新词怎么办？特殊词汇怎么办？更深层次的玩意⼉怎么办？

加上ML情感分析

from nltk.classify import NaiveBayesClassifier # 随⼿手造点训练集
s1 = 'this is a good book' s2 = 'this is a awesome book' s3 = 'this is a bad book' s4 = 'this is a terrible book'
def preprocess(s): # Func: 句句⼦子处理理 # 这⾥里里简单的⽤用了了split(), 把句句⼦子中每个单词分开 # 显然 还有更更多的processing method可以⽤用
return {word: True for word in s.lower().split()} # return⻓长这样: # {'this': True, 'is':True, 'a':True, 'good':True, 'book':True} # 其中, 前⼀一个叫fname, 对应每个出现的⽂文本单词; # 后⼀一个叫fval, 指的是每个⽂文本单词对应的值。 # 这⾥里里我们⽤用最简单的True,来表示,这个词『出现在当前的句句⼦子中』的意义。 # 当然啦, 我们以后可以升级这个⽅方程, 让它带有更更加⽜牛逼的fval, ⽐比如 word2vec

# 把训练集给做成标准形式
training_data = [[preprocess(s1), 'pos'], [preprocess(s2), 'pos'], [preprocess(s3), 'neg'], [preprocess(s4), 'neg']] # 喂给model吃
model = NaiveBayesClassifier.train(training_data) # 打出结果
print(model.classify(preprocess('this is a good book')))

2、文本相似度

⽤元素频率表⽰⽂本特征，常见的做法

【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

然后用余弦定理来计算文本相似度：

Frequency 频率统计：

import nltk from nltk import FreqDist # 做个词库先
corpus = 'this is my sentence ' \ 'this is my life ' \ 'this is the day'
# 随便便tokenize⼀一下 # 显然, 正如上⽂文提到, # 这⾥里里可以根据需要做任何的preprocessing: # stopwords, lemma, stemming, etc.
tokens = nltk.word_tokenize(corpus) print(tokens) # 得到token好的word list # ['this', 'is', 'my', 'sentence', # 'this', 'is', 'my', 'life', 'this', # 'is', 'the', 'day'] # 借⽤用NLTK的FreqDist统计⼀一下⽂文字出现的频率
fdist = FreqDist(tokens) # 它就类似于⼀一个Dict # 带上某个单词, 可以看到它在整个⽂文章中出现的次数
print(fdist['is']) # 3

# 好, 此刻, 我们可以把最常⽤用的50个单词拿出来
standard_freq_vector = fdist.most_common(50) size = len(standard_freq_vector) print(standard_freq_vector) # [('is', 3), ('this', 3), ('my', 2), # ('the', 1), ('d

【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

3、文本分类

TF: Term Frequency, 衡量⼀个term在⽂档中出现得有多频繁。
TF(t) = (t出现在⽂档中的次数) / (⽂档中的term总数).
IDF: Inverse Document Frequency, 衡量⼀个term有多重要。
有些词出现的很多，但是明显不是很有卵⽤。⽐如’is'，’the‘，’and‘之类
的。
为了平衡，我们把罕见的词的重要性（weight）搞⾼，
把常见词的重要性搞低。
IDF(t) = log_e(⽂档总数 / 含有t的⽂档总数).
TF-IDF = TF * IDF

举个栗⼦???? :
⼀个⽂档有100个单词，其中单词baby出现了3次。
那么，TF(baby) = (3/100) = 0.03.
好，现在我们如果有10M的⽂档， baby出现在其中的1000个⽂档中。
那么，IDF(baby) = log(10,000,000 / 1,000) = 4
所以， TF-IDF(baby) = TF(baby) * IDF(baby) = 0.03 * 4 = 0.12

from nltk.text import TextCollection # ⾸首先, 把所有的⽂文档放到TextCollection类中。 # 这个类会⾃自动帮你断句句, 做统计, 做计算
corpus = TextCollection(['this is sentence one', 'this is sentence two', 'this is sentence three']) # 直接就能算出tfidf # (term: ⼀一句句话中的某个term, text: 这句句话)
print(corpus.tf_idf('this', 'this is sentence four')) # 0.444342 # 同理理, 怎么得到⼀一个标准⼤大⼩小的vector来表示所有的句句⼦子? # 对于每个新句句⼦子
new_sentence = 'this is sentence five'
# 遍历⼀一遍所有的vocabulary中的词:
for word in standard_vocab: print(corpus.tf_idf(word, new_sentence)) # 我们会得到⼀一个巨⻓长(=所有vocab⻓长度)的向量量

目前几种表达句子的方式：词频，TF-IDF。

秒客网

【自然语言处理篇】--以NLTK为基础讲解自然语⾔处理的原理和基础知识

相关文章