r语言用于多分类的预测_在R中使用分类方法进行预测

时间:2024-04-03 18:12:05

r语言用于多分类的预测

In this analysis i’ll build a model that will predict whether a tumor is malignant or benign, based on data from a study on breast cancer. Classification algorithms will be used in the modelling process.

在此分析中,我将基于一项有关乳腺癌研究的数据,建立一个预测肿瘤是恶性还是良性的模型。 分类算法将在建模过程中使用。

The dataset

数据集

The data for this analysis refer to 569 patients from a study on breast cancer. The actual data can be found at UCI (Machine Learning Repository): https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The variables were computed from a digitized image of a breast mass and describe characteristics of the cell nucleus present in the image. In particular the variables are the following:

该分析的数据涉及来自乳腺癌研究的569名患者。 实际数据可以在UCI(机器学习存储库)中找到: https : //archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) 。 这些变量是根据乳腺肿块的数字化图像计算得出的,并描述了图像中存在的细胞核的特征。 特别是以下变量:

  1. radius (mean of distances from center to points on the perimeter)

    半径 (从中心到外围点的距离的平均值)

  2. texture (standard deviation of gray-scale values)

    纹理 (灰度值的标准偏差)

  3. perimeter

    周长

  4. area

  5. smoothness (local variation in radius lengths)

    平滑度 (半径长度的局部变化)

  6. compactness (perimeter^² / area — 1.0)

    紧凑度 (周长^²/面积— 1.0)

  7. concavity (severity of concave portions of the contour)

    凹度 (轮廓凹部的严重程度)

  8. concave points (number of concave portions of the contour)

    凹点 (轮廓的凹入部分的数量)

  9. symmetry

    对称

  10. fractal dimension (“coastline approximation” — 1)

    分形维数 (“海岸线近似” — 1)

  11. type (tumor can be either malignant -M- or benign -B-)

    类型 (肿瘤可以是恶性-M-或良性-B-)

探索性分析 (Exploratory Analysis)

It is essential to have an overview of the dataset. Below there is a box-plot of each predictor against the target variable (tumor). The log value of the predictors used instead of the actual values, for a better view of the plot.

概述数据集至关重要。 下面是每个预测变量相对于目标变量(肿瘤)的箱形图。 为了更好地查看图表,使用了预测变量的对数值而不是实际值。

r语言用于多分类的预测_在R中使用分类方法进行预测
Image by Author
图片作者

It seems that for most predictors the malignant level of tumor type has higher values than the benign level.

对于大多数预测因素而言,肿瘤类型的恶性程度似乎比良性水平高。

Now let’s see if the predictors are correlated. Below there is a scatter-plot matrix of all predictors.

现在,让我们看看预测变量是否相关。 下面是所有预测变量的散点图矩阵。

r语言用于多分类的预测_在R中使用分类方法进行预测
Image by Author
图片作者

We can see that there are some predictors that are strongly related, as expected, such as radius, perimeter & area.

我们可以看到,有些预测值与预期密切相关,例如半径,周长和面积。

A correlogram will serve us better and quantify all correlations.

相关图将更好地为我们服务并量化所有相关性。

r语言用于多分类的预测_在R中使用分类方法进行预测
Image by Author
图片作者

We can spot some less significant correlations, such as concave & concativity & compactness. Also concave against radius, perimeter & area.

我们可以发现一些不太重要的相关性,例如凹面,连通性和紧实度。 同时凹入半径,周长和面积。

使用分类方法进行预测 (Predicting using classification methods)

In the first part of this analysis, the goal is to predict whether the tumor is malignant or benign based on the variables produced by the digitized image using classification methods. Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

在此分析的第一部分中,目标是基于使用分类方法的数字化图像产生的变量,预测肿瘤是恶性的还是良性的。 分类是基于包含已知成员类别的观察值(或实例)的训练数据集来确定新观察值属于一组类别(子群体)中的哪个的问题。

So we must develop a model that classifies (categorize) each tumor (case) to either malignant or benign.

因此,我们必须建立一个模型,将每个肿瘤(病例)分类(分类)为恶性或良性。

Classification performed with 2 different methods, Logistic Regression and Decision Trees.

使用2种不同的方法( 逻辑回归决策树)进行分类。

功能选择 (Feature selection)

It is important to use only significant predictors while building the prediction model. You don’t need to use every feature at your disposal for creating an algorithm. You can assist the algorithm by feeding in only those features that are really important. Below there are some reasons for this:

在构建预测模型时,仅使用重要的预测变量很重要。 您无需使用所有可用功能来创建算法。 您可以通过仅提供真正重要的功能来辅助算法。 以下是造成这种情况的一些原因:

  • It enables the machine learning algorithm to train faster.

    它使机器学习算法的训练速度更快。
  • It reduces the complexity of a model and makes it easier to interpret.

    它降低了模型的复杂性,并使其更易于解释。
  • It improves the accuracy of a model if the right subset is chosen.

    如果选择正确的子集,则可以提高模型的准确性。
  • It reduces over-fitting.

    它减少了过度拟合。

In particular, i used the stepwise (forward & backward) logistic regression on the data, since the dataset is small. This method is computationally very expensive, so it is not recommended for very large datasets.

特别是,由于数据集很小,我对数据使用了逐步(向前和向后)逻辑回归。 此方法在计算上非常昂贵,因此不建议用于非常大的数据集。

After reviewing the stepwise selection, it was decided the following predictors to be used for all model building:

在回顾了逐步选择之后,决定将以下预测变量用于所有模型构建:

  1. radius (mean of distances from center to points on the perimeter)

    半径(从中心到外围点的距离的平均值)
  2. texture (standard deviation of gray-scale values)

    纹理(灰度值的标准偏差)
  3. area

  4. smoothness (local variation in radius lengths)

    平滑度(半径长度的局部变化)
  5. concave points (number of concave portions of the contour)

    凹点(轮廓的凹入部分的数量)
  6. symmetry

    对称

逻辑回归 (Logistic Regression)

Logistic regression is a parametric statistical learning method, used for classification especially when the outcome is binary. Logistic regression models the probability that a new observation belongs to a particular category. To fit the model, a method called maximum likelihood is used. Below there is an implementation of logistic regression.

Logistic回归是一种参数统计学习方法,用于分类,尤其是当结果为二进制时。 Logistic回归对新观测值属于特定类别的概率进行建模。 为了拟合模型,使用了一种称为最大似然的方法。 下面是逻辑回归的实现。

r语言用于多分类的预测_在R中使用分类方法进行预测
Image by Author
图片作者

By looking at the summary output of the logistic regression model we can see that almost all coefficients are positive, indicating that higher measures mean higher probability of a malignant tumor.

通过查看逻辑回归模型的摘要输出,我们可以看到几乎所有系数都是正的,这表明更高的测量值意味着发生恶性肿瘤的可能性更高。

An important step here is to evaluate the predicting ability of the model. Because the model’s predictions are probabilities, we must decide the threshold that will split the two possible outcomes. At first i’ll try the default threshold of 0.5. Below there is a confusion matrix of with predictions using this threshold.

这里重要的一步是评估模型的预测能力。 因为模型的预测是概率,所以我们必须确定将两个可能结果分开的阈值。 首先,我将尝试默认阈值0.5。 下面是使用此阈值的with的混淆矩阵。

r语言用于多分类的预测_在R中使用分类方法进行预测
Image by Author
图片作者

The overall accuracy of the model is 96.47 % (3.53 % error rate). But in this specific case we must distinguish the different types of error. In other words there are two types of error rate, type I & type II errors. In our case these are similar (type II error = 3.74% & type I error = 3.17%). Type I error means that a benign tumor is predicted to be malignant & type II error when a malignant tumor is predicted to be benign. Type II error is more expensive and we must find ways to eliminate it(even if it increases type I error).

该模型的整体准确性为96.47%(错误率3.53%)。 但是在这种特定情况下,我们必须区分不同类型的错误。 换句话说,错误率有两种类型,I型和II型错误。 在我们的情况下,这些是相似的(II型错误= 3.74%&I型错误= 3.17%)。 I型错误意味着良性肿瘤被预测为恶性,而II型错误则被预测为良性。 II型错误更为昂贵,我们必须找到消除它的方法(即使它增加了I型错误)。

Below i increased the threshold to 0.8, which changed the prediction model.

低于i时,将阈值增加到0.8,从而更改了预测模型。

r语言用于多分类的预测_在R中使用分类方法进行预测
Image by Author
图片作者

Although the overall accuracy of the model remains the same, now the type II error is eliminated but the type I error is increased. In other words, we now have a model that predicts perfectly a malign tumor but it also wrongly predicts some benign tumors as malignant (9.5%).

尽管模型的总体精度保持不变,但现在消除了II型误差,但I型误差增加了。 换句话说,我们现在拥有一个可以完美预测恶性肿瘤的模型,但是它也错误地将某些良性肿瘤预测为恶性的(9.5%)。

决策树 (Decision Trees)

Decision trees consist of a series of split points, often referred to as nodes. In order to make a prediction using a decision tree, we start at the top of the tree at a single node known as the root node. The root node is a decision or split point, because it places a condition in terms of the value of one of the input features, and based on this decision we know whether to continue on with the left part of the tree or with the right part of the tree. We repeat this process of choosing to go left or right at each inner node that we encounter until we reach one of the leaf nodes. These are the nodes at the base of the tree, which give us a specific value of the output to use as our prediction.

决策树由一系列拆分点组成,通常称为节点。 为了使用决策树做出预测,我们从树的顶部的单个节点(称为根节点)开始。 根节点是一个决策点或拆分点,因为它根据输入要素之一的值放置条件,并且基于此决策,我们知道是继续是树的左侧部分还是右侧部分的树。 我们重复选择在遇到的每个内部节点处向左或向右移动的过程,直到到达叶节点之一。 这些是树根的节点,这些节点为我们提供了输出的特定值以用作我们的预测。

r语言用于多分类的预测_在R中使用分类方法进行预测
Image by Author
图片作者

When performing the Decision Trees, as seen from the output, the overall prediction rate is 94.1% (5.9% error rate), which for the specific domain, is relatively low. In particular, the type II error is 5.61% & type I error = 6.35%. The model’s predictive performance is poorer than the previous one (logistic regression).

从输出结果看,执行决策树时,整体预测率为94.1%(错误率为5.9%),对于特定域而言,该预测率相对较低。 特别是,II型错误为5.61%,I型错误= 6.35%。 该模型的预测性能比前一个模型差(逻辑回归)。

Now let’s create a classification tree plot of the model.

现在,让我们创建模型的分类树图。

r语言用于多分类的预测_在R中使用分类方法进行预测
Image by Author
图片作者

From the plot above, we can assume that concave and texture are the most import important predictors for tumor type (splits on the classification trees).

从上面的图中,我们可以假设凹面和纹理是最重要的肿瘤类型预测指标(在分类树上划分)。

结果 (Results)

Finally, after building various models using different algorithms, the logistic regression model is chosen based on it’s performance (details on the table below).

最后,在使用不同算法构建了各种模型之后,根据其性能选择逻辑回归模型(下表中的详细信息)。

r语言用于多分类的预测_在R中使用分类方法进行预测
Image by Author
图片作者

In particular, especially after adjusting the threshold, it eliminated the type II error (wrongly predict malignant tumors as benign). This is really important in this specific problem.

特别是,尤其是在调整阈值之后,它消除了II型错误(错误地将恶性肿瘤预测为良性)。 这对于这个特定问题确实很重要。

As expected parametric methods, such as logistic regression, are performing better in this case, where we have a small dataset (569 observations).

正如预期的参数方法(例如逻辑回归)在这种情况下的效果更好,在这种情况下,我们的数据集很小(569个观测值)。

While our analysis is an interesting step it is based on a limited sample of cases. A larger sample of cases, would probably lead us to a better classification model.

尽管我们的分析是一个有趣的步骤,但它基于有限的案例样本。 更大的案例样本可能会导致我们获得更好的分类模型。

For the full code click here

有关完整代码, 请单击此处

Originally published at https://www.manosantoniou.com.

最初发布在 https://www.manosantoniou.com上

翻译自: https://towardsdatascience.com/predict-using-classification-methods-in-r-173477062576

r语言用于多分类的预测