sklearn.datasates 加载测试数据

数据一：波士顿房价（适合做回归），以后直接用boston标记
这行代码就读进来了
boston = sklearn.datasets.load_boston()
查询具体数据说明，用这个代码：
print boston.DESCR
输出如下：
Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000*s
一共506组数据，13维特征，
比如第一个维度的特征是犯罪率，第六个是每个房子平均多少房间等等。
boston.data 获取这506 * 13的特征数据
boston.target 获取对应的506 * 1的对应价格

数据二：牵牛花（适合做简单分类），标记为Iris
import sklearn.datasets
iris = sklearn.datasets.load_iris()
iris.data 获取特征
iris.target 获取对应的类别
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
这个数据基本是个ML的入门选手都知道，一共三类牵牛花，获取特征和对应的类别标签也是同上
一共150样本，3类，特征维度为4

数据三：糖尿病（回归问题），diabetes
这个数据包很奇怪，没有描述。我也到原本的UCI的网站上查了一下，也是没有太好的描述。
import sklearn.datasets
diabetes = sklearn.datasets.load_diabetes()
print diabetes.keys()
这样的输出只有data, targets。
我也观察了一下数据，感觉是经过额外的归一化处理的，原始的数据样貌已经看不出来了。。
下面这个图是我从网站上Copy下来的有限的描述，样本量为442，特征维度为10，每个特征元素的值都是连续的实数，在正负0.2之间。。目标这个整数值有可能是血糖。
Samples total 442
Dimensionality 10
Features real, -.2 < x < .2
Targets integer 25 - 346

数据四：手写数字识别（多类分类，10个类别，从0-9）digits
import sklearn.datasets
digits = sklearn.datasets.load_digits()
总体样本量：1797，每个类别大约180个样本，每个手写数字是一个8*8的图片，每个像素是0-16的整数值。

sklearn库用法:

https://blog.csdn.net/qq_30141957/article/details/80760474

秒客网

sklearn.datasates 加载测试数据

相关文章