RDataMining系列:Chapter 1 Introduction

时间:2021-07-12 19:17:54

data mining到底是什么呢

掌握下面分领域的知识,你可以说自己是一个DMer了:

• Clustering:
• Classification:
• Association Rules:
• Sequential Patterns:
• Time Series Analysis:
• Text Mining:

 

***************

各个方向上的R包集锦:

***************

1.  Clustering

• Packages:
– fpc
– cluster
– pvclust
– mclust
• Partitioning-based clustering: kmeans, pam, pamk, clara
• Hierarchical clustering: hclust, pvclust, agnes, diana
• Model-based clustering: mclust
• Density-based clustering: dbscan
• Plotting cluster solutions: plotcluster, plot.hclust
• Validating cluster solutions: cluster.stats

2. Classification
• Packages:
– rpart
– party
– randomForest
– rpartOrdinal
– tree
– marginTree
– maptree
– survival
• Decision trees: rpart, ctree
• Random forest: cforest, randomForest
• Regression, Logistic regression, Poisson regression: glm, predict, residuals
• Survival analysis: survfit, survdiff, coxph

3. Association Rules and Frequent Itemsets
• Packages:
– arules: supports to mine frequent itemsets, maximal frequent itemsets, closed fre-
quent itemsets and association rules
– drm: regression and association models for repeated categorical data
• APRIORI algorithm, a level-wise, breadth-first algorithm which counts transactions:
apriori, drm
• ECLAT algorithm: employs equivalence classes, depth-first search and set intersection
instead of counting: eclat
4. Sequential Patterns
• Package: arulesSequences
• SPADE algorithm: cSPADE
5. Time Series
• Package: timsac
• Time series construction: ts
• Decomposition: decomp, decompose, stl, tsr

6. Statistics
• Package: Base R, nlme
• Analysis of Variance: aov, anova
• Density analysis: density
• Statistical test: t.test, prop.test, anova, aov
• Linear mixed-effects model fit: lme
• Principal components and factor analysis: princomp
7. Graphics
• Bar chart: barplot
• Pie chart: pie
• Scattered plot: dotchart
• Histogram: hist
• Density: densityplot
• Candlestick chart, box plot: boxplot
• QQ (quantile-quantile) plot: qqnorm, qqplot, qqline
• Bi-variate plot: coplot
• Tree: rpart
• Parallel coordinates: parallel, paracoor, parcoord
• Heatmap, contour: contour, filled.contour
• Other plots: stripplot, sunflowerplot, interaction.plot, matplot, fourfold-
plot, assocplot, mosaicplot
• Saving graphs: pdf, postscript, win.metafile, jpeg, bmp, png
8. Data Manipulation
• Missing values: na.omit
• Standardize variables: scale
• Transpose: t
• Sampling: sample
• Stack: stack, unstack
• Others: aggregate, merge, reshape

 

更为具体的可以参考

R Reference Card for Data Mining

 

9. Interface to Weka
• RWeka: an R/Weka interface enabling to use all Weka functions in R.

 

本书所用的数据集:

 Iris Dataset

描述: five attributes

• sepal length in cm,
• sepal width in cm,
• petal length in cm,
• petal width in cm, and
• class: Iris Setosa, Iris Versicolour, and Iris Virginica.

R中要看一个数据集的描述:

> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

 

 Bodyfat Dataset

• age: age in years.
• DEXfat: body fat measured by DXA, response variable.
• waistcirc: waist circumference.
• hipcirc: hip circumference.
• elbowbreadth: breadth of the elbow.
• kneebreadth: breadth of the knee.
• anthro3a: sum of logarithm of three anthropometric measurements.
• anthro3b: sum of logarithm of three anthropometric measurements.
• anthro3c: sum of logarithm of three anthropometric measurements.
• anthro4: sum of logarithm of three anthropometric measurements.