R：基于索引的数据帧之间的NA替换

I have this data df.1:

我有这个数据df.1:

   month a       b          c                  
    1    0 0.000000000 0.000000000  
    2    0 0.000000000 0.001503194  
    3    0 0.000000000 0.000000000 
    4    0 0.000000000 0.000000000  
    5    0 0.000000000 0.000000000  
    6    0 0.000000000 0.000000000  
    7    0 0.000000000 0.000000000  
    8    0 0.000000000 0.000000000  
    9    0 0.000000000 0.000000000  
    10   0 0.000000000 0.000000000  
    11  NA       NA          NA  
    12  NA       NA          NA  
    1   0 0.000000000 0.000000000 
    2   0 0.001537279 0.006917756  
    3   0 0.000000000 0.003669725  
    4   0 0.000000000 0.000000000  
    5   0 0.000000000 0.000000000  
    6   0 0.000000000 0.000000000  
    7   0 0.000000000 0.000000000  
    8   0 0.000000000 0.000000000  
    9   0 0.000000000 0.000000000  
    10   0 0.000000000 0.000000000
    11   0 0.000000000 0.013513514
    12  NA     NA          NA

and this data df.2:

这个数据df.2:

month     a         b         c
    1  0.03842077 0.002266291 0.000000000 
    2  0.01359501 0.001027937 0.000000000 
    3  0.08631519 0.008732519 0.001376147 
    4  0.26564710 0.083635347 0.019053692 
    5  0.34839088 0.152203121 0.021010075 
    6  0.31767367 0.152029019 0.029397773 
    7  0.31507761 0.110973916 0.023445471 
    8  0.29773872 0.096458381 0.026745770 
    9  0.31226976 0.109342562 0.023996392 
    10 0.23841220 0.081582743 0.021674228 
    11 0.04379016 0.003519300 0.000000000 
    12 0.02244389 0.002493766 0.000000000

I would to subsitute the value NA (and only NA) in df.1[,2:4] with value in df.2[,2:4] when the index in column 1 (month) is the same. I tried with this code:

当第1列(月)中的索引相同时,我会将df.1 [,2:4]中的值NA(和仅NA)替换为df.2 [,2:4]中的值。我尝试使用此代码:

res_new <- data.frame(matrix(nrow=nrow(df.1),ncol=3))
for (n in 1:12){
res_new <- data.frame(ifelse(is.na(df.1[which(df.1[,1] == n),2:4])==TRUE,df.2[which(df.2[,1] == n),2:4],df.1[,n]))

  }

but the result is a big new matrix where each NA value in df.1 is substitued with all value in df.2

但结果是一个很大的新矩阵,其中df.1中的每个NA值都被替换为df.2中的所有值

How can do it? (My actual data frames are much bigger)

怎么办呢? (我的实际数据帧要大得多)

4 个解决方案

#1

Assuming that you have complete rows that have missing values that you want to fill in, you can do this is two steps using which and match.

假设您有完整的行具有要填写的缺失值,则可以执行此操作,使用哪个并匹配两个步骤。

# find the location of the missing rows in df
missRows <- which(!complete.cases(df.1))
# fill in missing rows with rows in df.2 with matching months
df.1[missRows, ] <- df.2[match(df.1$month[missRows], df.2$month, nomatch=0),]

Note that missing rows are identified with !complete.cases. Also, the nomatch=0 argument is used in order to ignore instances where no match is found.

请注意,缺少的行用!complete.cases标识。此外,使用nomatch = 0参数以忽略未找到匹配项的实例。

#2

The first 12 rows of the data:

前12行数据:

df.1 <- data.frame(
  month = 1:12, 
  a = c(rep(0, 10), NA, NA), 
  b = c(rep(0, 10), NA, NA), 
  c = c(0, 0.001503194, rep(0, 8), NA, NA)
)

df.2 <- data.frame(
  month = 1:12,
  a = c(0.03842077, 0.01359501, 0.08631519, 0.2656471, 0.34839088, 0.31767367, 
        0.31507761, 0.29773872, 0.31226976, 0.2384122, 0.04379016, 0.02244389), 
  b = c(0.002266291, 0.001027937, 0.008732519, 0.083635347, 0.152203121, 
        0.152029019, 0.110973916, 0.096458381, 0.109342562, 0.081582743, 
        0.0035193, 0.002493766 ), 
  c = c(0, 0, 0.001376147, 0.019053692, 0.021010075, 0.029397773, 0.023445471,
        0.02674577, 0.023996392, 0.021674228, 0, 0)
)

Solution

This solution allows for only some columns in a row to be NA. It might take some time on big data but gets the job done.

该解决方案仅允许一行中的某些列为NA。大数据可能需要一些时间才能完成工作。

for (row in 1:nrow(df.1)) {
  for (col in names(df.1)[-1]) {
    if (is.na(df.1[row, col]) && df.1[row, "month"] == df.2[row, "month"]) {
      df.1[row, col] <- df.2[row, col]
    }
  }
}
df.1

   month          a           b           c
1      1 0.00000000 0.000000000 0.000000000
2      2 0.00000000 0.000000000 0.001503194
3      3 0.00000000 0.000000000 0.000000000
4      4 0.00000000 0.000000000 0.000000000
5      5 0.00000000 0.000000000 0.000000000
6      6 0.00000000 0.000000000 0.000000000
7      7 0.00000000 0.000000000 0.000000000
8      8 0.00000000 0.000000000 0.000000000
9      9 0.00000000 0.000000000 0.000000000
10    10 0.00000000 0.000000000 0.000000000
11    11 0.04379016 0.003519300 0.000000000
12    12 0.02244389 0.002493766 0.000000000

Explanation

Using a double loop we check every element in the a to c columns. And if that element is not NA we proceed to the next one. Otherwise we check if the month in the same row in df.2 is the same and if that is TRUE we replace the element with corresponding one from df.2.

使用双循环,我们检查a到c列中的每个元素。如果该元素不是NA,我们继续下一个。否则,我们检查df.2中同一行中的月份是否相同,如果为TRUE,则用df.2中的相应元素替换该元素。

#3

Considering that you have a larger dataframe, I would try to avoid merging the tables. You can use ifelse to get the job done.

考虑到你有一个更大的数据帧,我会尽量避免合并表。您可以使用ifelse完成工作。

month <- c(1:12, 1:12)
a <- c(rep(0,10), NA, NA, rep(0,11), NA)
b <- c(rep(0,10), NA, NA, 0,.0015,rep(0,9), NA)
c <- c(0,.0015,rep(0,8), NA, NA, 0,.0069, .0036,rep(0,7), .0135, NA)
df.1 <- data.frame(month,a,b,c)

df.2 <- data.frame(month=c(1:12), a=rep(1,12), b=rep(2,12), c=rep(3,12))

df.1$a <- ifelse(is.na(df.1$a), df.2$a[match(df.1$month, df.2$month)], df.1$a)
df.1$b <- ifelse(is.na(df.1$b), df.2$b[match(df.1$month, df.2$month)], df.1$b)
df.1$c <- ifelse(is.na(df.1$c), df.2$c[match(df.1$month, df.2$month)], df.1$c)

> df.1
   month a      b      c
1      1 0 0.0000 0.0000
2      2 0 0.0000 0.0015
3      3 0 0.0000 0.0000
4      4 0 0.0000 0.0000
5      5 0 0.0000 0.0000
6      6 0 0.0000 0.0000
7      7 0 0.0000 0.0000
8      8 0 0.0000 0.0000
9      9 0 0.0000 0.0000
10    10 0 0.0000 0.0000
11    11 1 2.0000 3.0000
12    12 1 2.0000 3.0000
13     1 0 0.0000 0.0000
14     2 0 0.0015 0.0069
15     3 0 0.0000 0.0036
16     4 0 0.0000 0.0000
17     5 0 0.0000 0.0000
18     6 0 0.0000 0.0000
19     7 0 0.0000 0.0000
20     8 0 0.0000 0.0000
21     9 0 0.0000 0.0000
22    10 0 0.0000 0.0000
23    11 0 0.0000 0.0135
24    12 1 2.0000 3.0000

#4

Maybe is not the best way, but some approach like this could work!

也许不是最好的方法,但这样的方法可行!

df1 <- data.frame(month = 1:12,
                  a = c(rep(1, 10), NA, NA),
                  b = c(rep(2, 11), NA))

df2 <- data.frame(month = 1:12,
                  a = rnorm(12),
                  b = rnorm(12))

# first, merge both data frame by the key in this case the month
new_df <- merge(df1, df2, by = "month")

# then use a vectorize operation with ifelse function
new_df$imp_a <- ifelse(!is.na(new_df$a.x), new_df$a.x, new_df$a.y)

# then you need to drop the temporal columns or make a subset of the
# new imputed columns generated
new_df

Perhaps create a function for the ifelse step, if you need to impute many columns, like this:

也许为ifelse步骤创建一个函数,如果你需要输入许多列,如下所示:

impute <- function(df, col1, col2) {
 # impute col1 NA by col2 values creating a new column
 new_name <- paste("new", col1, by = "_")
 df[[new_name]] <- ifelse(!is.na(df[[col1]]), df[[col1]], df[[col2]])
 df
 }

impute(new_df, "a.x", "a.y")

#1