因子()和NAs的R caret / rfe变量选择

时间:2021-08-06 20:37:45

I have a data set with NAs sprinkled generously throughout.

我有一个数据集,里面有大量的NAs。

In addition it has columns that need to be factors().

此外,它的列需要是factors()。

I am using the rfe() function from the caret package to select variables.

我正在使用插入符号包中的rfe()函数来选择变量。

It seems the functions= argument in rfe() using lmFuncs works for the data with NAs but NOT on factor variables, while the rfFuncs works for factor variables but NOT NAs.

在rfe()中,使用lmFuncs对数据使用lmFuncs,而不是在factor变量上使用lmFuncs,而rfFuncs对factor变量(而不是NAs)起作用。

Any suggestions for dealing with this?

对此有什么建议吗?

I tried model.matrix() but it seems to just cause more problems.

我尝试过model.matrix(),但它似乎只会导致更多的问题。

1 个解决方案

#1


3  

Because of inconsistent behavior on these points between packages, not to mention the extra trickiness when going to more "meta" packages like caret, I always find it easier to deal with NAs and factor variables up front, before I do any machine learning.

由于这些点在包之间的行为不一致,更不用说在使用更多的“元”包(如插入符号)时的额外技巧了,我总是发现在进行任何机器学习之前,更容易处理NAs和因子变量。

  • For NAs, either omit or impute (median, knn, etc.).
  • 对于NAs,可以省略或传入(中位数、knn等)。
  • For factor features, you were on the right track with model.matrix(). It will let you generate a series of "dummy" features for the different levels of the factor. The typical usage is something like this:
  • 对于factor特性,您使用model.matrix()是正确的。它将让您为不同级别的因子生成一系列“哑”特性。典型的用法是这样的:
> dat = data.frame(x=factor(rep(1:3, each=5)))
> dat$x
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> model.matrix(~ x - 1, data=dat)
   x1 x2 x3
1   1  0  0
2   1  0  0
3   1  0  0
4   1  0  0
5   1  0  0
6   0  1  0
7   0  1  0
8   0  1  0
9   0  1  0
10  0  1  0
11  0  0  1
12  0  0  1
13  0  0  1
14  0  0  1
15  0  0  1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"

Also, just in case you haven't (although it sounds like you have), the caret vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html

同样,以防你没有(尽管听起来你有),CRAN上的插入符号非常好,并且涉及到了其中的一些要点。http://cran.r-project.org/web/packages/caret/index.html

#1


3  

Because of inconsistent behavior on these points between packages, not to mention the extra trickiness when going to more "meta" packages like caret, I always find it easier to deal with NAs and factor variables up front, before I do any machine learning.

由于这些点在包之间的行为不一致,更不用说在使用更多的“元”包(如插入符号)时的额外技巧了,我总是发现在进行任何机器学习之前,更容易处理NAs和因子变量。

  • For NAs, either omit or impute (median, knn, etc.).
  • 对于NAs,可以省略或传入(中位数、knn等)。
  • For factor features, you were on the right track with model.matrix(). It will let you generate a series of "dummy" features for the different levels of the factor. The typical usage is something like this:
  • 对于factor特性,您使用model.matrix()是正确的。它将让您为不同级别的因子生成一系列“哑”特性。典型的用法是这样的:
> dat = data.frame(x=factor(rep(1:3, each=5)))
> dat$x
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> model.matrix(~ x - 1, data=dat)
   x1 x2 x3
1   1  0  0
2   1  0  0
3   1  0  0
4   1  0  0
5   1  0  0
6   0  1  0
7   0  1  0
8   0  1  0
9   0  1  0
10  0  1  0
11  0  0  1
12  0  0  1
13  0  0  1
14  0  0  1
15  0  0  1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"

Also, just in case you haven't (although it sounds like you have), the caret vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html

同样,以防你没有(尽管听起来你有),CRAN上的插入符号非常好,并且涉及到了其中的一些要点。http://cran.r-project.org/web/packages/caret/index.html