机器学习笔记——决策树

时间:2023-02-13 07:31:47

判断是否会购买电脑的案例:
数据以data.csv文件存储,内容为。
机器学习笔记——决策树

RID age income  student credit_rating   Class: buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no

决策树实现
1、导入所需模块:

##sklearn对输入数据的格式有一定要求,只支持整型的数据,不支持类型数据,故需要对输入数据进行转换;
from sklearn.feature_extraction import DictVectorizer
##涉及到对csv文件的读取,故导入csv接口
import csv
from sklearn import preprocessing
from sklearn import tree
from sklearn.externals.six import StringIO

2、从csv文件中读取数据

##将csv文件中的数据读取到变量allElectronicsData中
allElectronicsData=open(r'data.csv')
##csv自带的reader可以按行读取allElectronicsData中的数据
reader=csv.reader(allElectronicsData)
##读取第一行数据即title
headers=reader.next()
print headers

打印出结果为:

['RID', 'age', 'income', 'student', 'credit_rating', 'Class: buys_computer']

3、数据预处理:
sklearn要求数据输入的特征值(属性)features以及输出的类,必须是数值型的值,而不能是类别值(如income属性中的high、medium、low)。

featureList = []
labelList = []

for row in reader:
labelList.append(row[len(row)-1])
rowDict = {}
for i in range(1,len(row)-1):
# print row[i]
rowDict[headers[i]] = row[i]
# print "rowDict:",rowDict
featureList.append(rowDict)

### list中的每一个字典对应原始数据中的一行数据 <featureList[0]对应第1行原始数据>
print featureList
print type(featureList[0])

打印出结果为:
生成的list中的每一个字典对应原始数据中的一行数据,如{‘credit_rating’: ‘fair’, ‘age’: ‘youth’, ‘student’: ‘no’, ‘income’: ‘high’}

[{'credit_rating': 'fair', 'age': 'youth', 'student': 'no', 'income': 'high'}, 
{'credit_rating': 'excellent', 'age': 'youth', 'student': 'no', 'income': 'high'},
{'credit_rating': 'fair', 'age': 'middle_aged', 'student': 'no', 'income': 'high'},
{'credit_rating': 'fair', 'age': 'senior', 'student': 'no', 'income': 'medium'},
{'credit_rating': 'fair', 'age': 'senior', 'student': 'yes', 'income': 'low'},
{'credit_rating': 'excellent', 'age': 'senior', 'student': 'yes', 'income': 'low'},
{'credit_rating': 'excellent', 'age': 'middle_aged', 'student': 'yes', 'income': 'low'},
{'credit_rating': 'fair', 'age': 'youth', 'student': 'no', 'income': 'medium'},
{'credit_rating': 'fair', 'age': 'youth', 'student': 'yes', 'income': 'low'},
{'credit_rating': 'fair', 'age': 'senior', 'student': 'yes', 'income': 'medium'},
{'credit_rating': 'excellent', 'age': 'youth', 'student': 'yes', 'income': 'medium'},
{'credit_rating': 'excellent', 'age': 'middle_aged', 'student': 'no', 'income': 'medium'},
{'credit_rating': 'fair', 'age': 'middle_aged', 'student': 'yes', 'income': 'high'},
{'credit_rating': 'excellent', 'age': 'senior', 'student': 'no', 'income': 'medium'}]
<type 'dict'>

将数据进行编码处理,将字符型的数据转化为0、1:
即age中的youth对应001、middle_aged对应100、senior对应010;

vec = DictVectorizer()
dummyX = vec.fit_transform(featureList).toarray()
print "dummyX:\n"+str(dummyX)
print vec.get_feature_names()
lb = preprocessing.LabelBinarizer()
dummyY=lb.fit_transform(labelList)
print "dummyY:"+str(dummyY)

打印结果:

dummyX:
[[ 0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]
[ 0. 0. 1. 1. 0. 1. 0. 0. 1. 0.]
[ 1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
[ 0. 1. 0. 0. 1. 0. 0. 1. 1. 0.]
[ 0. 1. 0. 0. 1. 0. 1. 0. 0. 1.]
[ 0. 1. 0. 1. 0. 0. 1. 0. 0. 1.]
[ 1. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
[ 0. 0. 1. 0. 1. 0. 0. 1. 1. 0.]
[ 0. 0. 1. 0. 1. 0. 1. 0. 0. 1.]
[ 0. 1. 0. 0. 1. 0. 0. 1. 0. 1.]
[ 0. 0. 1. 1. 0. 0. 0. 1. 0. 1.]
[ 1. 0. 0. 1. 0. 0. 0. 1. 1. 0.]
[ 1. 0. 0. 0. 1. 1. 0. 0. 0. 1.]
[ 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]

['age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'income=high', 'income=low', 'income=medium', 'student=no', 'student=yes']
dummyY:
[[0]
[0]
[1]
[1]
[1]
[0]
[1]
[0]
[1]
[1]
[1]
[1]
[1]
[0]]

使用决策树作为分类器

clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX,dummyY)
print "clf:"+str(clf)

打印结果:

clf:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')

将获得的决策树写入dot文件:

with open("allElectronicsData.dot","w") as f:
f = tree.export_graphviz(clf,feature_names=vec.get_feature_names(),out_file=f)

在电脑上安装Graphviz后,用GVEdit打开allElectronicsData.dot文件,可看到生成的决策树:

机器学习笔记——决策树

对新数据进行预测:
取X得第一行数据

oneRowX = dummyX[0,:]
print "oneRowX:"+str(oneRowX)

打印出X的第一行数据为:

oneRowX:[ 0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]

将X的第一行数据的第一位置1第三位置0

newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print "newRowX:"+str(newRowX)

打印出新构造的的数据为:

newRowX:[ 1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]

对新数据进行预测:

predictedY = clf.predict(oneRowX.reshape(1,-1))
print "predictedY:"+str(predictedY)

打印出结果为:

predictedY:[1]

即buy computer。


机器学习笔记——决策树