Python NLTK 自然语言处理入门与例程

发现外国的这个博客写的很不错，流程清晰明了，讲述简单，操作可行。文章脉络如下：

https://likegeeks.com/nlp-tutorial-using-python-nltk/ 点击打开链接

要注意的地方：

1、在文本爬虫这里，在我电脑（python3.5）需要更改一下Beautifulsoup的解析器:

###原来的解析器为html5lib.
soup = BeautifulSoup(html, "lxml")

获取文本方面完整的代码如下：

from bs4 import BeautifulSoup

import urllib.request

response = urllib.request.urlopen('http://php.net/')
print(type(response) )
html = response.read()  #   html5lib.
# print(html )
soup = BeautifulSoup(html, "lxml")
print(soup)
text = soup.get_text(strip=True)
print(text)

2、文章中提到支持语言：

Stemming non-English Words

SnowballStemmer can stem 13 languages besides the English language.

The supported languages are:

查看一下支持处理的语言：

from nltk.stem import SnowballStemmer
 
print(SnowballStemmer.languages)

('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')

（丹麦语，荷兰语，英语，芬兰语，法语，德语，匈牙利语，意大利语，挪威语，“波特”，“葡萄牙语”，“罗马尼亚语”，“俄语”，“西班牙语”，“西班牙语”，“瑞典语”）

也就是说，中文的话需要利用其它的工具，在这里我推荐一下，仅供参考：

Jieba：可以用来做分词，词性标注，TextRank

HanLP：分词，命名实体识别，依存句法分析，还有FudanNLP，NLPIR

偶然也发现，对这个博客的大部分内容有一个中文翻译版。见另一位博友点击打开链接

秒客网

Python NLTK 自然语言处理入门与例程

相关文章