python处理停用词(stopwords)

python处理停用词stopwords

停用词是什么
- 从一段文本中删除停用词

停用词是什么

将数据转换为计算机可以理解的内容的过程称为预处理。预处理的主要形式之一是过滤掉无用的数据。在自然语言处理中，无用的单词（数据）称为停用词。
停用词是指搜索引擎已编程忽略的常用词（例如“the”，“a”，“an”，“in”）。
我们不希望这些单词占用我们数据库中的空间，或占用宝贵的处理时间。为此，我们可以通过存储要停止使用的单词的列表轻松删除它们。python中的NLTK（自然语言工具包）具有以16种不同语言存储的停用词列表。可以在nltk_data目录中找到它们。home / pratima / nltk_data / corpora / stopwords是目录地址（不要忘记更改你的主目录名称）

从一段文本中删除停用词

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english')) 

word_tokens = word_tokenize(example_sent) 

filtered_sentence = [w for w in word_tokens if not w in stop_words] 

print(word_tokens) 
print(filtered_sentence)

输出为

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

秒客网

python处理停用词(stopwords)

python处理停用词stopwords

停用词是什么

从一段文本中删除停用词

相关文章