机器学习和数据集介绍、数据集划分、特征抽取、归一化

时间:2024-03-04 07:12:51

机器学习介绍和数据集介绍

机器学习:
  机器学习是一门多学科交叉专业,涵盖概率论知识,统计学知识,近似理论知识和复杂算法知识,使用计算机作为
工具并致力于真实实时的模拟人类学习方式,并将现有内容进行知识结构划分来有效提高学习效率。
  很难明确的定义,简单的来说,机器学习就是利用数学方法和计算机技术通过对历史数据进行分析得到规律(模
型),并利用规律对未知数据进行预测。
 
数据集:

  机器学习是从历史数据获得规律,那这些历史数据是什么样的呢?

可以获取的数据集 :
  1. scikit-learn数据量较小, 方便学习
  2. kaggle大数据竞赛平台,80万科学家, 真实数据,数据量巨大
  3. UCI收录了360个数据集,覆盖科学,生活,经济等领域,数据量几十万
常用的数据集结构组成 : 特征值 + 目标值

 

 # 注意 : 有些数据集可以没有目标值 。每一行就是一个样本。 每一列就是一个特征。 最后要预测的值就是目标。


scikit-learn 

  scikit-learn是基于Python语言的机器学习工具

  1. 简单高效的数据挖掘和数据分析工具
  2. 可供大家在各种环境中重复使用
  3. 建立在Numpy , SciPy 和 matplotlib上
  4. 开源 , 可商业使用 -BSD许可证

Scikit-learn 数据集API介绍

 

1. sklearn.datasets
    1.1 加载获取流行数据集
    1.2 datasets.load_*() -- 获取小规模数据集,数据包含在datasets里
    1.3 datasetss.fetch_*(data_home=None)
    获取大规模数据集,需要从网络上下载,函数的第一个参数是data_home, 表示数据集。下载目录,默认是-/scikit-learn_data/
2. load_* 和 fetch_* 返回的数据类型是datasets.base.Bunch(字典格式)
    data:特征数据数组,是[n_ samples*n_features]的二维numpy.ndarray数组
    target:标签数组,是n_samples的维numpy.ndarray数组
    DESCR:数据描述
    feature_names:特征名,新闻数据,手写数字,回归数据集没有
    target_names;标签名
#关于第二点, load_* 用于获取小数据集 , fetch_* 用于获取大数据集

 

scikit-learn 的使用 :

# 导入方式
from sklearn.datasets import load_iris    # load_iris  导入是鸢尾花的数据
 
# 加载鸢尾花的数据
li = load_iris()

print(\'获取特征值\',li.data)  # 鸢尾花的特征,官方早已分类好的,可供直接使用
print(\'目标值\',li.target)    # 分了3个类

li.DESCR  # 鸢尾花的描述li.feature_names  # 鸢尾花的特征名  花长  花宽li.target_names  # 鸢尾花的标签名
#1 特征值  # 值太多,只复制一部分展示
获取特征值
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 
#2 目标值
目标值
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

#3 描述
\'.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n                \n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\\'s paper. Note that it\\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...\'

#4 特征名
[\'sepal length (cm)\',
 \'sepal width (cm)\',
 \'petal length (cm)\',
 \'petal width (cm)\']

#5 标签名
array([\'setosa\', \'versicolor\', \'virginica\'], dtype=\'<U10\')
上代码运行结果,copy过来的