基于概率论的分类方法:朴素贝叶斯——使用朴素贝叶斯进行文档分类

时间:2021-04-22 05:41:52

前言

之前讨论过的k-近邻算法和决策树都是结果确定的分类算法,今天讨论的分类算法将不能完全确定数据实例应该划分到某个分类,或者只能给出数据实例属于给定分类的概率。

嘤嘤语录:朴素贝叶斯解决的问题是,今天下雨的概率问题,你需要根据概率确定今天要不要带伞。

说明:从本章开始,将不提供完整代码,只提供某个算法对应的代码块。

 

需求

以各大社交媒体为例,我们经常屏蔽一些关键性的词汇。我们要构建一个快速过滤器,如果某条留言使用了负面或者侮辱性的语言,那么就将该留言标识为内容不当。

 

步骤

1.准备数据

 1 def loadDataSet():
2 postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
3 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
4 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
5 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
6 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
7 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
8 classVec = [0,1,0,1,0,1] #1 is abusive, 0 not
9 return postingList,classVec
10
11 def createVocabList(dataSet):
12 vocabSet = set([]) #create empty set
13 for document in dataSet:
14 vocabSet = vocabSet | set(document) #union of the two sets
15 return list(vocabSet)
16
17 def setOfWords2Vec(vocabList, inputSet):
18 returnVec = [0]*len(vocabList)
19 for word in inputSet:
20 if word in vocabList:
21 returnVec[vocabList.index(word)] = 1
22 else: print "the word: %s is not in my Vocabulary!" % word
23 return returnVec
函数loadDataSet()创建了一些实验样本。postingList是一系列的词条集合,classVec是一个类别标签的集合。

函数createVocabList(dataSet)创建一个包含在文档中出现的不重复词的列表,词汇表。
函数setOfWords2Vec(vocabList, inputSet)首先创建一个和词汇表等长的向量,并将其元素都设置为0.
                      接着,遍历文档中的所有单词,如果出现了词汇表中的单词,则将输出的文档向量中的对应值设为1.

打开IDE,我们进一步熟悉一下刚才的三个函数:
>>> import bayes
>>> listOPosts,listClasses = bayes.loadDataSet()
>>> listOPosts
[[
'my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
>>> listClasses
[0,
1, 0, 1, 0, 1]
>>> myVocabList = bayes.createVocabList(listOPosts)
>>> myVocabList
[
'cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']

发现现在没有出现重复的单词

>>> bayes.setOfWords2Vec(myVocabList,listOPosts[0])
[0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
myVocabList
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park',
'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying',
'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog',
'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']

listOPosts[0]

['my', 'dog', 'has', 'flea', 'problems', 'help', 'please']

2.训练算法
 1 def trainNB0(trainMatrix,trainCategory):
2 numTrainDocs = len(trainMatrix) #6
3 numWords = len(trainMatrix[0]) #32
4 pAbusive = sum(trainCategory)/float(numTrainDocs) #3/6.0
5 p0Num = zeros(numWords); p1Num = zeros(numWords) #change to ones()
6 p0Denom = 0.0; p1Denom = 0.0 #change to 2.0
7 for i in range(numTrainDocs): # 0 1 2 3 4 5 6
8 if trainCategory[i] == 1:
9 p1Num += trainMatrix[i]
10 p1Denom += sum(trainMatrix[i])
11 else:
12 p0Num += trainMatrix[i]
13 p0Denom += sum(trainMatrix[i])
14 p1Vect = (p1Num/p1Denom) #change to log()
15 p0Vect = (p0Num/p0Denom) #change to log()
16 return p0Vect,p1Vect,pAbusive

trainCategory
[0, 1, 0, 1, 0, 1]

trainMat
[[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],

[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0],

[1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],

[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1],

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]

 >>> for postinDoc in listOPosts:
trainMat.append(bayes.setOfWords2Vec(myVocabList,postinDoc))

>>> p0v,p1v,pab=bayes.trainNB0(trainMat,listClasses)

>>> p0varray([ 0.04166667, 0.04166667, 0.04166667, 0. , 0. ,
0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667,
0.04166667, 0.04166667, 0.04166667, 0. , 0. ,
0.08333333, 0. , 0. , 0.04166667, 0. ,
0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667,
0.04166667, 0. , 0.04166667, 0. , 0.04166667,
0.04166667, 0.125 ])
>>> p1v
array([ 0. , 0. , 0. ,
0.05263158, 0.05263158,
0. , 0. , 0. ,
0.05263158, 0.05263158,
0. , 0. , 0. ,
0.05263158, 0.05263158,
0.05263158, 0.05263158, 0.05263158, 0. , 0.10526316,
0. ,
0.05263158, 0.05263158, 0. , 0.10526316,
0. ,
0.15789474, 0. , 0.05263158, 0. ,
0. , 0. ])

pab=0.5,说明文档属于侮辱类的概率是0.5。一共输入了6句话,其中3句是侮辱性言论,因此侮辱性言论的概率是0.5

 

嘤嘤语录,前面处理数据的方式,可以看成是把

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

['stop', 'posting', 'stupid', 'worthless', 'garbage'],

['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

里面的数据按照事先给好的标签【0,1,0,1,0,1】分成两类

第一类是0的

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

分别计算每行在字典出现的次数/除以总的小数据量24

(关于在字典里出现的次数的理解:看到一个单词去字典查阅,有就标记一下,tag随查阅到的字数的增加而增加)

([ 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1.,
0., 0., 2., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1.,
0., 1., 0., 1., 1., 3.])

 

同理,对于标签为1的侮辱性

['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

['stop', 'posting', 'stupid', 'worthless', 'garbage'],

 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

查阅字典后,得到的是

([ 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0.,
1., 1., 1., 1., 1., 0., 2., 0., 1., 1., 0., 2., 0.,
3., 0., 1., 0., 0., 0.])

分别计算每行在字典出现的次数/除以总的小数据量19

这样理解一下,思路就清晰多了

为符合实际情况,我们把所有词出现的次数初始化为1,并将分母初始化为2,为方便计算,我们定义概率为log(p)

 p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
p0Denom = 2.0; p1Denom = 2.0 #change to 2.0
 p1Vect = log(p1Num/p1Denom)          #change to log()
p0Vect = log(p0Num/p0Denom) #change to log()

 

朴素贝叶斯分类函数

 1 def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
2 p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult
3 p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
4 if p1 > p0:
5 return 1
6 else:
7 return 0
8
9 def testingNB():
10 listOPosts,listClasses = loadDataSet()
11 myVocabList = createVocabList(listOPosts)
12 trainMat=[]
13 for postinDoc in listOPosts:
14 trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
15 p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
16 testEntry = ['love', 'my', 'dalmation']
17 thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
18 print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
19 testEntry = ['stupid', 'garbage']
20 thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
21 print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
>>> reload(bayes)
<module 'bayes' from 'D:\Python27\bayes.pyc'>
>>> bayes.testingNB()
[
'love', 'my', 'dalmation'] classified as: 0
[
'stupid', 'garbage'] classified as: 1

 

文档词袋模型

def bagOfWords2VecMN(vocabList, inputSet):
returnVec
= [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)]
+= 1
return returnVec

setOfWords2Vec()几乎完全相同,唯一不同的是当每遇到一个单词,就会增加向量中的对应值,而不仅是将对应的数值设为1.