R -根据配对数据填充缺失信息

时间:2021-12-09 23:40:59

I am trying to fill missing cases with information based a related individual (couple).

我正试图用一个相关的个体(夫妇)的信息来填补缺失的案例。

My data looks like this

我的数据是这样的

   hserial    sex age     children
1  1001041   Male  30          Yes
2  1001041 Female  32          Yes
3  1001061   Male  22           No
4  1001061 Female  21           No
5  1001091   Male  38          Yes
6  1001091 Female  37          Yes
7  1001151   Male  31           No
8  1001151 Female  27 Not eligible
9  1001161   Male  33          Yes
10 1001161 Female  35          Yes

So hserial is the couple identifier. Row number 8 has a missing case Not eligible but the information is available from the partner (row number 7).

hserial是偶标识符。第8行有一个不符合条件的缺失情况,但是可以从合作伙伴处获得信息(第7行)。

I am trying to find a neat way to fill these missing with partner's info.

我正在寻找一种简洁的方式来填补这些缺失的伙伴的信息。

I was thinking doing something like

我想做一些类似的事情

library(dplyr) 

childsum = dta %>% group_by(hserial, sex, children) %>% 
summarise(n = n()) %>% spread(sex, children) 

I will get

我将得到

  hserial n Male       Female
1 1001041 1  Yes          Yes
2 1001061 1   No           No
3 1001091 1  Yes          Yes
4 1001151 1   No Not eligible
5 1001161 1  Yes          Yes

Then I could do something like

然后我可以做一些类似的事情

childsum$Male = ifelse(childsum$Male == 'Not eligible', childsum$Female, childsum$Male)
childsum$Female = ifelse(childsum$Female == 'Not eligible', childsum$Male, childsum$Female)

So for every missing of Male fill with Female info and vice versa. Then merge back the results in order to get

所以对于每一个缺失的男性,都要填写女性的信息,反之亦然。然后归并结果以获得

   hserial    sex age     children
1  1001041   Male  30          Yes
2  1001041 Female  32          Yes
3  1001061   Male  22           No
4  1001061 Female  21           No
5  1001091   Male  38          Yes
6  1001091 Female  37          Yes
7  1001151   Male  31           No
8  1001151 Female  27           No
9  1001161   Male  33          Yes
10 1001161 Female  35          Yes

Any idea how to do this is a neat way ?

知道怎么做吗?

dta = structure(list(hserial = c(1001041, 1001041, 1001061, 1001061, 
1001091, 1001091, 1001151, 1001151, 1001161, 1001161), sex = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Male", "Female"
), class = "factor"), age = c(30, 32, 22, 21, 38, 37, 31, 27, 
33, 35), children = structure(c(5L, 5L, 6L, 6L, 5L, 5L, 6L, 4L, 
5L, 5L), .Label = c("DNA Does not apply", "NA No answer", "NA No answer", 
"Not eligible", "Yes", "No"), class = "factor")), class = "data.frame", .Names = c("hserial", 
"sex", "age", "children"), row.names = c(NA, -10L))

1 个解决方案

#1


3  

Here's an approach which assumes that any couple (consisting of two hserials) should always have the same yes/no entry in children unless both persons have Not eligible entries. Therefore, it computes per couple the setdiff of available children info and Not eligible. In cases where all (both) entries are "Not eligible", it returns NA, since I think that's a better way to handle missing values (as you know there are many specialized functions you can use with NAs that you cannot use the same way for Not eligible entries).

这里有一种方法,它假定任何一对(由两个hserials组成)在子对象中都应该有相同的yes/no条目,除非这两个人都没有合格的条目。因此,它计算每对夫妇可获得的儿童信息的setdiff,但不合格。在所有(两个)项都“不合格”的情况下,它返回NA,因为我认为这是处理缺失值的一种更好的方法(正如您所知道的,有许多专门的函数可以用于NAs,但是对于不合格的项不能使用相同的方法)。

dta %>% 
  group_by(hserial) %>% 
  mutate(children = if(all(children == "Not eligible")) NA_character_ else 
                       setdiff(children, "Not eligible"))
#Source: local data frame [10 x 4]
#Groups: hserial [5]
#
#   hserial    sex   age children
#     (dbl) (fctr) (dbl)    (chr)
#1  1001041   Male    30      Yes
#2  1001041 Female    32      Yes
#3  1001061   Male    22       No
#4  1001061 Female    21       No
#5  1001091   Male    38      Yes
#6  1001091 Female    37      Yes
#7  1001151   Male    31       No
#8  1001151 Female    27       No
#9  1001161   Male    33      Yes
#10 1001161 Female    35      Yes

#1


3  

Here's an approach which assumes that any couple (consisting of two hserials) should always have the same yes/no entry in children unless both persons have Not eligible entries. Therefore, it computes per couple the setdiff of available children info and Not eligible. In cases where all (both) entries are "Not eligible", it returns NA, since I think that's a better way to handle missing values (as you know there are many specialized functions you can use with NAs that you cannot use the same way for Not eligible entries).

这里有一种方法,它假定任何一对(由两个hserials组成)在子对象中都应该有相同的yes/no条目,除非这两个人都没有合格的条目。因此,它计算每对夫妇可获得的儿童信息的setdiff,但不合格。在所有(两个)项都“不合格”的情况下,它返回NA,因为我认为这是处理缺失值的一种更好的方法(正如您所知道的,有许多专门的函数可以用于NAs,但是对于不合格的项不能使用相同的方法)。

dta %>% 
  group_by(hserial) %>% 
  mutate(children = if(all(children == "Not eligible")) NA_character_ else 
                       setdiff(children, "Not eligible"))
#Source: local data frame [10 x 4]
#Groups: hserial [5]
#
#   hserial    sex   age children
#     (dbl) (fctr) (dbl)    (chr)
#1  1001041   Male    30      Yes
#2  1001041 Female    32      Yes
#3  1001061   Male    22       No
#4  1001061 Female    21       No
#5  1001091   Male    38      Yes
#6  1001091 Female    37      Yes
#7  1001151   Male    31       No
#8  1001151 Female    27       No
#9  1001161   Male    33      Yes
#10 1001161 Female    35      Yes