如何使用R中的party包修复ctree函数中的“void type(NULL)for variable”错误?

时间:2022-06-01 19:38:06

I am using the ctree() in the party package from R. I want to be able to columns from more than one dataframe, for which I would call use column separately (using $) as I have in the past with this function but this time it is not working.

我正在使用来自R的聚会包中的ctree()。​​我希望能够从多个数据帧中获取列,我将其分别称为使用列(使用$),就像我过去使用此函数一样但是这个时间不起作用。

For the purposes of illustrating the error, I've put together a sample data set as a single dataframe. When I run:

为了说明错误,我将一个样本数据集组合为一个数据帧。当我跑:

>ctree(data$adult_age~data$child_age+data$freq)

I get the following error:

我收到以下错误:

>Error in model.frame.default(formula = ~data$adult_age, data = list(),  : 
  invalid type (NULL) for variable 'data$adult_age'

If I run it like this, it works:

如果我像这样运行它,它的工作原理:

>ctree(adult_age~child_age+freq, data)

Usually those two ways of writing it out are interchangeable (e.g. with lm() I get the same results with both), but with ctree() I am running into an error. Why? How can I fix this so that I can pull from different dataframes at once without combining them?

通常这两种写出来的方式是可以互换的(例如,使用lm()我得到两者相同的结果),但是使用ctree()我遇到了错误。为什么?如何解决这个问题,以便我可以一次性从不同的数据帧中提取它们而不将它们组合起来?

My data structure looks like this:

我的数据结构如下所示:

> dput(data)

>structure(list(adult_age = c(38, 38, 38, 38, 38, 55.5, 55.5, 38, 38, 38), child_age = c(8, 8, 13, 3.5, 3.5, 13, 8, 8, 8, 13), freq = c(0.1, 12, 0.1, 0.1, 0.1, 0.1, 1, 2, 0.1, 0.1)), .Names = c("adult_age", "child_age", "freq"), class = "data.frame", row.names = c(12L, 13L, 14L, 15L, 18L, 20L, 22L, 23L, 24L, 25L))

If you want to run the sample data:

如果要运行示例数据:

>adult_age = c(38, 38, 38, 38, 38, 55.5, 55.5, 38, 38, 38)

>child_age = c(8, 8, 13, 3.5, 3.5, 13, 8, 8, 8, 13)

>freq = c(0.1, 12, 0.1, 0.1, 0.1, 0.1, 1, 2, 0.1, 0.1)

>data=as.data.frame(cbind(adult_age, child_age, freq))

1 个解决方案

#1


2  

Why this approach should not be applied

Never use data$ inside model formulas (as already pointed out by @Roland). Apart from the fact that you unnecessarily repeat the data name and have to type more, it is a source of confusion and errors. If you haven't encountered this problem, yet, with lm() then you haven't used predict(). Consider a simple linear regression for your data:

永远不要在模型公式中使用数据$(正如@Roland已经指出的那样)。除了您不必要地重复数据名称并且必须输入更多内容之外,它还是混淆和错误的来源。如果你没有遇到这个问题,那么,使用lm()然后你没有使用predict()。考虑对您的数据进行简单的线性回归:

m1 <- lm(adult_age ~ child_age, data = data)
m2 <- lm(data$adult_age ~ data$child_age)
coef(m1) - coef(m2)
## (Intercept)   child_age 
##           0           0 

Thus, both approaches lead to the same coefficient estimates etc. But in all situations where you want to use the same formula with a different/updated/subsetted data, you run into trouble. Prominently, in predict(), e.g., when making a prediction at child_age = 0. The intended usage with formula and data separated correctly recovers the intercept:

因此,两种方法都会导致相同的系数估计等。但是,在您希望将相同公式与不同/更新/子集化数据一起使用的所有情况下,您都会遇到麻烦。突出地,在predict()中,例如,当在child_age = 0处进行预测时,公式和数据分开的预期用法可以恢复截距:

predict(m1, newdata = data.frame(child_age = 0))
##        1 
## 36.38919 
coef(m1)[1]
## (Intercept) 
##    36.38919 

But for the data$ version the newdata is not used at all in the actual prediction:

但是对于数据$版本,在实际预测中根本不使用新数据:

predict(m2, newdata = data.frame(child_age = 0))
##        1        2        3        4        5        6        7        8 
## 41.14343 41.14343 44.11483 38.46917 38.46917 44.11483 41.14343 41.14343 
##        9       10 
## 41.14343 44.11483 
## Warning message:
## 'newdata' had 1 row but variables found have 10 rows 

There are more examples like this. But this one should be serious enough to refrain from this.

还有更多这样的例子。但是这个应该严重到足以避免这种情况。

How it can be applied to ctree()

If you are determined to shoot yourself in the foot with the data$ approach, you can do so with the new (and recommended) implementation of ctree() in the partykit package. The whole formula/data handling was rewritten, using standard nonstandard evaluation.

如果你决定使用数据$方法射击自己,你可以使用partykit包中新的(和推荐的)ctree()实现。使用标准非标准评估重写整个公式/数据处理。

library("partykit")
ctree(adult_age ~ child_age + freq, data = data)
## Model formula:
## adult_age ~ child_age + freq
## 
## Fitted party:
## [1] root: 41.500 (n = 10, err = 490.0) 
## 
## Number of inner nodes:    0
## Number of terminal nodes: 1
ctree(data$adult_age ~ data$child_age + data$freq)
## Model formula:
## data$adult_age ~ data$child_age + data$freq
## 
## Fitted party:
## [1] root: 41.500 (n = 10, err = 490.0) 
## 
## Number of inner nodes:    0
## Number of terminal nodes: 1

#1


2  

Why this approach should not be applied

Never use data$ inside model formulas (as already pointed out by @Roland). Apart from the fact that you unnecessarily repeat the data name and have to type more, it is a source of confusion and errors. If you haven't encountered this problem, yet, with lm() then you haven't used predict(). Consider a simple linear regression for your data:

永远不要在模型公式中使用数据$(正如@Roland已经指出的那样)。除了您不必要地重复数据名称并且必须输入更多内容之外,它还是混淆和错误的来源。如果你没有遇到这个问题,那么,使用lm()然后你没有使用predict()。考虑对您的数据进行简单的线性回归:

m1 <- lm(adult_age ~ child_age, data = data)
m2 <- lm(data$adult_age ~ data$child_age)
coef(m1) - coef(m2)
## (Intercept)   child_age 
##           0           0 

Thus, both approaches lead to the same coefficient estimates etc. But in all situations where you want to use the same formula with a different/updated/subsetted data, you run into trouble. Prominently, in predict(), e.g., when making a prediction at child_age = 0. The intended usage with formula and data separated correctly recovers the intercept:

因此,两种方法都会导致相同的系数估计等。但是,在您希望将相同公式与不同/更新/子集化数据一起使用的所有情况下,您都会遇到麻烦。突出地,在predict()中,例如,当在child_age = 0处进行预测时,公式和数据分开的预期用法可以恢复截距:

predict(m1, newdata = data.frame(child_age = 0))
##        1 
## 36.38919 
coef(m1)[1]
## (Intercept) 
##    36.38919 

But for the data$ version the newdata is not used at all in the actual prediction:

但是对于数据$版本,在实际预测中根本不使用新数据:

predict(m2, newdata = data.frame(child_age = 0))
##        1        2        3        4        5        6        7        8 
## 41.14343 41.14343 44.11483 38.46917 38.46917 44.11483 41.14343 41.14343 
##        9       10 
## 41.14343 44.11483 
## Warning message:
## 'newdata' had 1 row but variables found have 10 rows 

There are more examples like this. But this one should be serious enough to refrain from this.

还有更多这样的例子。但是这个应该严重到足以避免这种情况。

How it can be applied to ctree()

If you are determined to shoot yourself in the foot with the data$ approach, you can do so with the new (and recommended) implementation of ctree() in the partykit package. The whole formula/data handling was rewritten, using standard nonstandard evaluation.

如果你决定使用数据$方法射击自己,你可以使用partykit包中新的(和推荐的)ctree()实现。使用标准非标准评估重写整个公式/数据处理。

library("partykit")
ctree(adult_age ~ child_age + freq, data = data)
## Model formula:
## adult_age ~ child_age + freq
## 
## Fitted party:
## [1] root: 41.500 (n = 10, err = 490.0) 
## 
## Number of inner nodes:    0
## Number of terminal nodes: 1
ctree(data$adult_age ~ data$child_age + data$freq)
## Model formula:
## data$adult_age ~ data$child_age + data$freq
## 
## Fitted party:
## [1] root: 41.500 (n = 10, err = 490.0) 
## 
## Number of inner nodes:    0
## Number of terminal nodes: 1