I am trying to fill missing cases with information based a related individual (couple).
我正试图用一个相关的个体(夫妇)的信息来填补缺失的案例。
My data looks like this
我的数据是这样的
hserial sex age children
1 1001041 Male 30 Yes
2 1001041 Female 32 Yes
3 1001061 Male 22 No
4 1001061 Female 21 No
5 1001091 Male 38 Yes
6 1001091 Female 37 Yes
7 1001151 Male 31 No
8 1001151 Female 27 Not eligible
9 1001161 Male 33 Yes
10 1001161 Female 35 Yes
So hserial
is the couple identifier. Row number 8 has a missing case Not eligible
but the information is available from the partner (row number 7).
hserial是偶标识符。第8行有一个不符合条件的缺失情况,但是可以从合作伙伴处获得信息(第7行)。
I am trying to find a neat way to fill these missing with partner's info.
我正在寻找一种简洁的方式来填补这些缺失的伙伴的信息。
I was thinking doing something like
我想做一些类似的事情
library(dplyr)
childsum = dta %>% group_by(hserial, sex, children) %>%
summarise(n = n()) %>% spread(sex, children)
I will get
我将得到
hserial n Male Female
1 1001041 1 Yes Yes
2 1001061 1 No No
3 1001091 1 Yes Yes
4 1001151 1 No Not eligible
5 1001161 1 Yes Yes
Then I could do something like
然后我可以做一些类似的事情
childsum$Male = ifelse(childsum$Male == 'Not eligible', childsum$Female, childsum$Male)
childsum$Female = ifelse(childsum$Female == 'Not eligible', childsum$Male, childsum$Female)
So for every missing of Male
fill with Female
info and vice versa. Then merge back the results in order to get
所以对于每一个缺失的男性,都要填写女性的信息,反之亦然。然后归并结果以获得
hserial sex age children
1 1001041 Male 30 Yes
2 1001041 Female 32 Yes
3 1001061 Male 22 No
4 1001061 Female 21 No
5 1001091 Male 38 Yes
6 1001091 Female 37 Yes
7 1001151 Male 31 No
8 1001151 Female 27 No
9 1001161 Male 33 Yes
10 1001161 Female 35 Yes
Any idea how to do this is a neat way ?
知道怎么做吗?
dta = structure(list(hserial = c(1001041, 1001041, 1001061, 1001061,
1001091, 1001091, 1001151, 1001151, 1001161, 1001161), sex = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Male", "Female"
), class = "factor"), age = c(30, 32, 22, 21, 38, 37, 31, 27,
33, 35), children = structure(c(5L, 5L, 6L, 6L, 5L, 5L, 6L, 4L,
5L, 5L), .Label = c("DNA Does not apply", "NA No answer", "NA No answer",
"Not eligible", "Yes", "No"), class = "factor")), class = "data.frame", .Names = c("hserial",
"sex", "age", "children"), row.names = c(NA, -10L))
1 个解决方案
#1
3
Here's an approach which assumes that any couple (consisting of two hserial
s) should always have the same yes/no entry in children
unless both persons have Not eligible
entries. Therefore, it computes per couple the setdiff
of available children
info and Not eligible
. In cases where all (both) entries are "Not eligible", it returns NA
, since I think that's a better way to handle missing values (as you know there are many specialized functions you can use with NA
s that you cannot use the same way for Not eligible
entries).
这里有一种方法,它假定任何一对(由两个hserials组成)在子对象中都应该有相同的yes/no条目,除非这两个人都没有合格的条目。因此,它计算每对夫妇可获得的儿童信息的setdiff,但不合格。在所有(两个)项都“不合格”的情况下,它返回NA,因为我认为这是处理缺失值的一种更好的方法(正如您所知道的,有许多专门的函数可以用于NAs,但是对于不合格的项不能使用相同的方法)。
dta %>%
group_by(hserial) %>%
mutate(children = if(all(children == "Not eligible")) NA_character_ else
setdiff(children, "Not eligible"))
#Source: local data frame [10 x 4]
#Groups: hserial [5]
#
# hserial sex age children
# (dbl) (fctr) (dbl) (chr)
#1 1001041 Male 30 Yes
#2 1001041 Female 32 Yes
#3 1001061 Male 22 No
#4 1001061 Female 21 No
#5 1001091 Male 38 Yes
#6 1001091 Female 37 Yes
#7 1001151 Male 31 No
#8 1001151 Female 27 No
#9 1001161 Male 33 Yes
#10 1001161 Female 35 Yes
#1
3
Here's an approach which assumes that any couple (consisting of two hserial
s) should always have the same yes/no entry in children
unless both persons have Not eligible
entries. Therefore, it computes per couple the setdiff
of available children
info and Not eligible
. In cases where all (both) entries are "Not eligible", it returns NA
, since I think that's a better way to handle missing values (as you know there are many specialized functions you can use with NA
s that you cannot use the same way for Not eligible
entries).
这里有一种方法,它假定任何一对(由两个hserials组成)在子对象中都应该有相同的yes/no条目,除非这两个人都没有合格的条目。因此,它计算每对夫妇可获得的儿童信息的setdiff,但不合格。在所有(两个)项都“不合格”的情况下,它返回NA,因为我认为这是处理缺失值的一种更好的方法(正如您所知道的,有许多专门的函数可以用于NAs,但是对于不合格的项不能使用相同的方法)。
dta %>%
group_by(hserial) %>%
mutate(children = if(all(children == "Not eligible")) NA_character_ else
setdiff(children, "Not eligible"))
#Source: local data frame [10 x 4]
#Groups: hserial [5]
#
# hserial sex age children
# (dbl) (fctr) (dbl) (chr)
#1 1001041 Male 30 Yes
#2 1001041 Female 32 Yes
#3 1001061 Male 22 No
#4 1001061 Female 21 No
#5 1001091 Male 38 Yes
#6 1001091 Female 37 Yes
#7 1001151 Male 31 No
#8 1001151 Female 27 No
#9 1001161 Male 33 Yes
#10 1001161 Female 35 Yes