如何使用tidyr填充分组变量的每个值中的已完成行?

时间:2023-01-30 00:38:07

Say I have data on people who choose between several options. I have one row per person, and I want to have one row per person and choice option. So, if I have 10 people who have 3 choices, right now I have 10 rows, and I want to have 30.

假设我有关于在多个选项之间进行选择的人的数据。我每人有一排,我希望每人有一排和选择。所以,如果我有10个人有3个选择,现在我有10行,我想有30个。

All of the other variables should be copied to each of the new rows. So, for example, if I have a variable for gender, that should be constant within ID. (I am setting my data up this way to analyze with mnlogit.)

应将所有其他变量复制到每个新行。因此,例如,如果我有一个性别变量,那么它应该在ID内保持不变。 (我以这种方式设置我的数据以使用mnlogit进行分析。)

This seems like the situation that two tidyr functions, complete and fill, were designed for. To use a simple example:

这似乎是两个tidyr功能,完整和填充,设计的情况。使用一个简单的例子:

library(lubridate)
library(tidyr)
dat <- data.frame(
    id = 1:3,
    choice = 5:7,
    c = c(9, NA, 11),
    d = ymd(NA, "2015-09-30", "2015-09-29")
    )

dat %>% 
  complete(id, choice) %>%
  fill(everything())

# Source: local data frame [9 x 4]
# 
#      id choice     c          d
#   (int)  (int) (dbl)     (time)
# 1     1      5     9       <NA>
# 2     1      6     9       <NA>
# 3     1      7     9       <NA>
# 4     2      5     9       <NA>
# 5     2      6     9 2015-09-30
# 6     2      7     9 2015-09-30
# 7     3      5     9 2015-09-30
# 8     3      6     9 2015-09-30
# 9     3      7    11 2015-09-29

But this has some problems -- the values of d were carried forward correctly, but the values of c from ID 1 replaced the (correct) NA values for ID 2.

但是这有一些问题 - d的值正确地结转,但是来自ID 1的c值取代了ID 2的(正确的)NA值。

I could try a workaround, like replacing all of the missing values with 999, running complete and fill, and then replacing 999 with NA. (I think I would have to convert the date variables to character variables and then convert them back again if I go this route.) But maybe someone on here knows of a tidy way to do this with tidyr?

我可以尝试一种解决方法,比如用999替换所有缺失的值,运行完整和填充,然后用NA替换999。 (我想我必须将日期变量转换为字符变量,然后再转换回来,如果我走这条路线的话。)但也许这里有人知道用tidyr做一个整洁的方法吗?

Edit: the desired output here is:

编辑:这里所需的输出是:

# Source: local data frame [9 x 4]
# 
#     id     c          d choice
#  (int) (dbl)     (time)  (int)
# 1     1     9       <NA>      5
# 2     1     9       <NA>      6
# 3     1     9       <NA>      7
# 4     2    NA 2015-09-30      5
# 5     2    NA 2015-09-30      6
# 6     2    NA 2015-09-30      7
# 7     3    11 2015-09-29      5
# 8     3    11 2015-09-29      6
# 9     3    11 2015-09-29      7

4 个解决方案

#1


9  

You can use the trick of "grouping" things to complete within complete using c(). This makes it so that it is only completed using preexisting combinations of the grouped variables.

您可以使用c()完成“分组”操作的技巧。这使得它仅使用预先存在的分组变量组合来完成。

library(tidyr)
dat %>% complete(c(id, c, d), choice) 
     id     c          d choice
  (int) (dbl)     (time)  (int)
1     1     9       <NA>      5
2     1     9       <NA>      6
3     1     9       <NA>      7
4     2    NA 2015-09-30      5
5     2    NA 2015-09-30      6
6     2    NA 2015-09-30      7
7     3    11 2015-09-29      5
8     3    11 2015-09-29      6
9     3    11 2015-09-29      7

#2


11  

As an update to @jeremycg answer. From tidyr 0.5.1 (or maybe even version 0.4.0) onwards c() does not work anymore. Use nesting() instead:

作为@jeremycg的更新答案。从tidyr 0.5.1(或者甚至是0.4.0版本)起,c()不再起作用了。使用nesting()代替:

dat %>% 
 complete(nesting(id, c, d), choice) 

Note I was trying to edit @jeremycg answer, since the answer was correct at the time it was written (and hence a new answer is not really necessary) but unfortunately the edit got rejected.

注意我正在尝试编辑@jeremycg的答案,因为在编写时答案是正确的(因此不需要新的答案)但不幸的是编辑被拒绝了。

#3


2  

I think you're better off keeping the data separate while you prepare it, and then merging before you need to do the regression.

我认为你最好在准备数据时保持数据分离,然后在需要进行回归之前进行合并。

subjectdata <- dat[,c("id", "c", "d")]
questiondata <- dat[,c("id", "choice")] %>% complete(id, choice)

And then

接着

> merge(questiondata, subjectdata)
  id choice  c          d
1  1      5  9       <NA>
2  1      6  9       <NA>
3  1      7  9       <NA>
4  2      5 NA 2015-09-30
5  2      6 NA 2015-09-30
6  2      7 NA 2015-09-30
7  3      5 11 2015-09-29
8  3      6 11 2015-09-29
9  3      7 11 2015-09-29

as necessary. That way you also get a valid d column for user 2, without relying on the order of questions in the data frame.

有必要的。这样,您还可以获得用户2的有效d列,而无需依赖数据框中的问题顺序。

#4


0  

It looks like another approach is to use spread and gather. spread creates one column per possible answer, and gather takes the separate columns and reshapes them into rows. With these data:

看起来另一种方法是使用传播和聚集。 spread为每个可能的答案创建一列,并且gather将获取单独的列并将它们重新整形为行。有了这些数据:

dat %>%
  spread(choice, choice) %>%
  gather(choice, drop_me, `5`:`7`) %>%  # Drop me is a redundant column
  select(-drop_me) %>%
  arrange(id, choice)  # reorders so that the answer matches

#   id  c          d choice
# 1  1  9       <NA>      5
# 2  1  9       <NA>      6
# 3  1  9       <NA>      7
# 4  2 NA 2015-09-30      5
# 5  2 NA 2015-09-30      6
# 6  2 NA 2015-09-30      7
# 7  3 11 2015-09-29      5
# 8  3 11 2015-09-29      6
# 9  3 11 2015-09-29      7

I haven't done any testing to see how these compare in efficiency.

我没有做任何测试,看看这些效率如何比较。

#1


9  

You can use the trick of "grouping" things to complete within complete using c(). This makes it so that it is only completed using preexisting combinations of the grouped variables.

您可以使用c()完成“分组”操作的技巧。这使得它仅使用预先存在的分组变量组合来完成。

library(tidyr)
dat %>% complete(c(id, c, d), choice) 
     id     c          d choice
  (int) (dbl)     (time)  (int)
1     1     9       <NA>      5
2     1     9       <NA>      6
3     1     9       <NA>      7
4     2    NA 2015-09-30      5
5     2    NA 2015-09-30      6
6     2    NA 2015-09-30      7
7     3    11 2015-09-29      5
8     3    11 2015-09-29      6
9     3    11 2015-09-29      7

#2


11  

As an update to @jeremycg answer. From tidyr 0.5.1 (or maybe even version 0.4.0) onwards c() does not work anymore. Use nesting() instead:

作为@jeremycg的更新答案。从tidyr 0.5.1(或者甚至是0.4.0版本)起,c()不再起作用了。使用nesting()代替:

dat %>% 
 complete(nesting(id, c, d), choice) 

Note I was trying to edit @jeremycg answer, since the answer was correct at the time it was written (and hence a new answer is not really necessary) but unfortunately the edit got rejected.

注意我正在尝试编辑@jeremycg的答案,因为在编写时答案是正确的(因此不需要新的答案)但不幸的是编辑被拒绝了。

#3


2  

I think you're better off keeping the data separate while you prepare it, and then merging before you need to do the regression.

我认为你最好在准备数据时保持数据分离,然后在需要进行回归之前进行合并。

subjectdata <- dat[,c("id", "c", "d")]
questiondata <- dat[,c("id", "choice")] %>% complete(id, choice)

And then

接着

> merge(questiondata, subjectdata)
  id choice  c          d
1  1      5  9       <NA>
2  1      6  9       <NA>
3  1      7  9       <NA>
4  2      5 NA 2015-09-30
5  2      6 NA 2015-09-30
6  2      7 NA 2015-09-30
7  3      5 11 2015-09-29
8  3      6 11 2015-09-29
9  3      7 11 2015-09-29

as necessary. That way you also get a valid d column for user 2, without relying on the order of questions in the data frame.

有必要的。这样,您还可以获得用户2的有效d列,而无需依赖数据框中的问题顺序。

#4


0  

It looks like another approach is to use spread and gather. spread creates one column per possible answer, and gather takes the separate columns and reshapes them into rows. With these data:

看起来另一种方法是使用传播和聚集。 spread为每个可能的答案创建一列,并且gather将获取单独的列并将它们重新整形为行。有了这些数据:

dat %>%
  spread(choice, choice) %>%
  gather(choice, drop_me, `5`:`7`) %>%  # Drop me is a redundant column
  select(-drop_me) %>%
  arrange(id, choice)  # reorders so that the answer matches

#   id  c          d choice
# 1  1  9       <NA>      5
# 2  1  9       <NA>      6
# 3  1  9       <NA>      7
# 4  2 NA 2015-09-30      5
# 5  2 NA 2015-09-30      6
# 6  2 NA 2015-09-30      7
# 7  3 11 2015-09-29      5
# 8  3 11 2015-09-29      6
# 9  3 11 2015-09-29      7

I haven't done any testing to see how these compare in efficiency.

我没有做任何测试,看看这些效率如何比较。