如何根据R中的其他行和列中的值填充数据帧

时间:2022-09-11 13:46:19

Suppose I have a data frame that looks like this:

假设我有一个这样的数据框架:

ID   T  X  Y  Z
1    1  A  A  NA
1    2  B  A  NA
1    3  B  B  NA
1    4  B  A  NA
2    1  A  B  NA
2    2  A  A  NA
2    3  B  A  NA
2    4  A  B  NA
3    1  B  B  NA
3    2  B  B  NA
3    3  B  B  NA
3    4  B  A  NA

And I would like to replace the value of Z based on some conditionals that depend on both row and (previous) column values so that the above ends up looking like this:

我想基于一些依赖于行和(先前)列值的条件来替换Z的值,以便上面的结果是这样的:

ID   T  X  Y  Z
1    1  A  A  0
1    2  B  A  0
1    3  B  B  1
1    4  B  A  NA
2    1  A  B  0
2    2  A  A  0
2    3  B  A  0
2    4  A  B  0
3    1  B  B  1
3    2  B  B  NA
3    3  B  B  NA
3    4  B  A  NA

The rules:

规则:

  1. Z takes the value of 1 the first time (in order by T, and within an ID) that both X and Y one that row have the value B.
  2. Z第一次取1的值(以T为单位,ID为单位)X和Y都取B。
  3. Z takes (or retains) the value NA if and only if for any smaller value of T, it has taken the value of 1 already.
  4. Z取(或保留)值NA,当且仅当对于T的任何较小值,它已经取了1的值。
  5. When T = 1, Z takes the value of 0 if X and Y on that row do not both equal B.
  6. 当T = 1时,Z取0如果X和Y不同时等于B。
  7. When T > 1, Z takes the value of 0 if X and Y on that row do not both equal B, AND the value of Z on the previous row = zero.
  8. 当T > 1时,Z取0,如果X和Y在那一行上不同时等于B,而Z在前一行上的值= 0。

I want the following to work, and it gets me kinda close but no dice:

我想要下面的方法奏效,它让我有点接近但没有机会:

df$Z <- NA
for (t in 1:4) {
  df$Z[ (df$X=="B" & df$Y=="B") & df$T==1] <- 1
  df$Z[!(df$X=="B" & df$Y=="B") & df$T==1] <- 0
  if (t>1) {
    df$Z[ (df$X=="B" & df$Y=="B") & df$T==t & (!is.na(df$Z[t-1]) & df$Z[t-1]==0)] <- 0
    df$Z[!(df$X=="B" & df$Y=="B") & df$T==t & (!is.na(df$Z[t-1]) & df$Z[t-1]==0)] <- 1
  }
}

On the other hand, I can write series of nested if... then statements looping across all observations, but that is excruciatingly slow (at least, compared to the program I am translating from on Stata).

另一方面,我可以写一系列嵌套if…然后语句在所有的观察中循环,但是这是非常缓慢的(至少,与我在Stata上翻译的程序相比)。

I am sure I am committing twelve kinds of gaffes in my attempt above, but a few hours of banging my head on this has not resolved it.

我确信我在上面的尝试中犯了十二种错误,但是几个小时的碰头并没有解决这个问题。

So I come to you begging, hat in hand. :)

所以我带着帽子向你乞讨。:)

Edit: it occurs to me that sharing the Stata code (which resolves this so much faster than what I have come up with in R, which is ironic, given my preference for R over Stata's language :) might help with suggestions. This does what I want, and does it fast (even with, say, N=1600, T=11):

编辑:在我看来,共享Stata代码(它比我在R中找到的要快得多,这很讽刺,因为我更喜欢R而不是Stata的语言:)可能会对建议有所帮助。这是我想要的,而且速度快(即使N=1600, T=11):

replace Z = .
forvalues t = 1(1)4 {
  replace Z = 1 if X == "B" & Y == "B" & T == 1
  replace Z = 0 if X == "B" & Y == "B" & T == 1
  replace Z = 1 if X == "B" & Y == "B" & T == `t' & Z[_n-1] == 0 & `t' > 1
  replace Z = 0 if X == "B" & Y == "B" & T == `t' & Z[_n-1] == 0 & `t' > 1
  }

2 个解决方案

#1


2  

Here's one approach using ave and transform:

这里有一个使用ave和transform的方法:

transform(dat[order(dat$ID, dat$T), ],
          Z = ave(X == "B" & Y == "B", ID, FUN = function(x) {
            as.integer("is.na<-"(x, (duplicated(x) & cumsum(x)) | 
                                    c(0, diff(x)) < 0)) }))

#    ID T X Y  Z
# 1   1 1 A A  0
# 2   1 2 B A  0
# 3   1 3 B B  1
# 4   1 4 B A NA
# 5   2 1 A B  0
# 6   2 2 A A  0
# 7   2 3 B A  0
# 8   2 4 A B  0
# 9   3 1 B B  1
# 10  3 2 B B NA
# 11  3 3 B B NA
# 12  3 4 B A NA

where dat is the name of your data frame. The reordering (dat[order(dat$ID, dat$T), ]) is not necessary if the rows are already ordered along ID and T.

dat是数据帧的名称。如果已经沿着ID和T对行进行了排序,则不需要重新排序(dat[order(dat$ID, dat$T))。

#2


1  

Another possibillity using by

另一个possibillity使用

ll <- by(df, df$ID, function(x){        
  x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
  x$Z[x$Z > 1] <- NA
  x
})

df2 <- do.call(rbind, ll)
df2
#      ID T X Y  Z
# 1.1   1 1 A A  0
# 1.2   1 2 B A  0
# 1.3   1 3 B B  1
# 1.4   1 4 B A NA
# 2.5   2 1 A B  0
# 2.6   2 2 A A  0
# 2.7   2 3 B A  0
# 2.8   2 4 A B  0
# 3.9   3 1 B B  1
# 3.10  3 2 B B NA
# 3.11  3 3 B B NA
# 3.12  3 4 B A NA

Same function but using ddply instead:

功能相同,但使用ddply替代:

library(plyr)
df2 <- ddply(.data = df, .variables = .(ID), function(x){        
  x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
  x$Z[x$Z > 1] <- NA
  x
})

df2

#1


2  

Here's one approach using ave and transform:

这里有一个使用ave和transform的方法:

transform(dat[order(dat$ID, dat$T), ],
          Z = ave(X == "B" & Y == "B", ID, FUN = function(x) {
            as.integer("is.na<-"(x, (duplicated(x) & cumsum(x)) | 
                                    c(0, diff(x)) < 0)) }))

#    ID T X Y  Z
# 1   1 1 A A  0
# 2   1 2 B A  0
# 3   1 3 B B  1
# 4   1 4 B A NA
# 5   2 1 A B  0
# 6   2 2 A A  0
# 7   2 3 B A  0
# 8   2 4 A B  0
# 9   3 1 B B  1
# 10  3 2 B B NA
# 11  3 3 B B NA
# 12  3 4 B A NA

where dat is the name of your data frame. The reordering (dat[order(dat$ID, dat$T), ]) is not necessary if the rows are already ordered along ID and T.

dat是数据帧的名称。如果已经沿着ID和T对行进行了排序,则不需要重新排序(dat[order(dat$ID, dat$T))。

#2


1  

Another possibillity using by

另一个possibillity使用

ll <- by(df, df$ID, function(x){        
  x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
  x$Z[x$Z > 1] <- NA
  x
})

df2 <- do.call(rbind, ll)
df2
#      ID T X Y  Z
# 1.1   1 1 A A  0
# 1.2   1 2 B A  0
# 1.3   1 3 B B  1
# 1.4   1 4 B A NA
# 2.5   2 1 A B  0
# 2.6   2 2 A A  0
# 2.7   2 3 B A  0
# 2.8   2 4 A B  0
# 3.9   3 1 B B  1
# 3.10  3 2 B B NA
# 3.11  3 3 B B NA
# 3.12  3 4 B A NA

Same function but using ddply instead:

功能相同,但使用ddply替代:

library(plyr)
df2 <- ddply(.data = df, .variables = .(ID), function(x){        
  x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
  x$Z[x$Z > 1] <- NA
  x
})

df2