I have a data set with NAs
sprinkled generously throughout.
我有一个数据集,里面有大量的NAs。
In addition it has columns that need to be factors()
.
此外,它的列需要是factors()。
I am using the rfe()
function from the caret
package to select variables.
我正在使用插入符号包中的rfe()函数来选择变量。
It seems the functions=
argument in rfe()
using lmFuncs
works for the data with NAs but NOT on factor variables, while the rfFuncs
works for factor variables but NOT NAs.
在rfe()中,使用lmFuncs对数据使用lmFuncs,而不是在factor变量上使用lmFuncs,而rfFuncs对factor变量(而不是NAs)起作用。
Any suggestions for dealing with this?
对此有什么建议吗?
I tried model.matrix()
but it seems to just cause more problems.
我尝试过model.matrix(),但它似乎只会导致更多的问题。
1 个解决方案
#1
3
Because of inconsistent behavior on these points between packages, not to mention the extra trickiness when going to more "meta" packages like caret
, I always find it easier to deal with NAs and factor variables up front, before I do any machine learning.
由于这些点在包之间的行为不一致,更不用说在使用更多的“元”包(如插入符号)时的额外技巧了,我总是发现在进行任何机器学习之前,更容易处理NAs和因子变量。
- For NAs, either omit or impute (median, knn, etc.).
- 对于NAs,可以省略或传入(中位数、knn等)。
- For factor features, you were on the right track with
model.matrix()
. It will let you generate a series of "dummy" features for the different levels of the factor. The typical usage is something like this: - 对于factor特性,您使用model.matrix()是正确的。它将让您为不同级别的因子生成一系列“哑”特性。典型的用法是这样的:
> dat = data.frame(x=factor(rep(1:3, each=5)))
> dat$x
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> model.matrix(~ x - 1, data=dat)
x1 x2 x3
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 0 1 0
7 0 1 0
8 0 1 0
9 0 1 0
10 0 1 0
11 0 0 1
12 0 0 1
13 0 0 1
14 0 0 1
15 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
Also, just in case you haven't (although it sounds like you have), the caret
vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html
同样,以防你没有(尽管听起来你有),CRAN上的插入符号非常好,并且涉及到了其中的一些要点。http://cran.r-project.org/web/packages/caret/index.html
#1
3
Because of inconsistent behavior on these points between packages, not to mention the extra trickiness when going to more "meta" packages like caret
, I always find it easier to deal with NAs and factor variables up front, before I do any machine learning.
由于这些点在包之间的行为不一致,更不用说在使用更多的“元”包(如插入符号)时的额外技巧了,我总是发现在进行任何机器学习之前,更容易处理NAs和因子变量。
- For NAs, either omit or impute (median, knn, etc.).
- 对于NAs,可以省略或传入(中位数、knn等)。
- For factor features, you were on the right track with
model.matrix()
. It will let you generate a series of "dummy" features for the different levels of the factor. The typical usage is something like this: - 对于factor特性,您使用model.matrix()是正确的。它将让您为不同级别的因子生成一系列“哑”特性。典型的用法是这样的:
> dat = data.frame(x=factor(rep(1:3, each=5)))
> dat$x
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> model.matrix(~ x - 1, data=dat)
x1 x2 x3
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 0 1 0
7 0 1 0
8 0 1 0
9 0 1 0
10 0 1 0
11 0 0 1
12 0 0 1
13 0 0 1
14 0 0 1
15 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
Also, just in case you haven't (although it sounds like you have), the caret
vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html
同样,以防你没有(尽管听起来你有),CRAN上的插入符号非常好,并且涉及到了其中的一些要点。http://cran.r-project.org/web/packages/caret/index.html