汇总数据帧,以沿着子集返回非na值

时间:2022-11-01 01:38:08

Hoping that someone can help me with a trick. I've found similar questions online, but none of the examples I've seen do exactly what I'm looking for or work on my data structure.

希望有人能帮我个忙。我在网上也发现了类似的问题,但我所见过的例子中,没有一个是完全符合我的要求,也没有一个是在研究我的数据结构。

I need to remove NAs from a data frame along data subsets and compress the remaining NA values into rows for each data subset.

我需要沿着数据子集从数据帧中删除NAs,并将其余的NA值压缩为每个数据子集的行。

Example:

例子:

#create example data
a <- c(1, 1, 1, 2, 2, 2) #this is the subsetting variable in the example
b <- c(NA, NA, "B", NA, NA, "C") #max 1 non-NA value for each subset
c <- c("A", NA, NA, "A", NA, NA)
d <- c(NA, NA, 1, NA, NA, NA) #some subsets for some columns have all NA values

dat <- as.data.frame(cbind(a, b, c, d)) 

> desired output
  a b c    d
  1 B A    1
  2 C A <NA>

Rules of thumb: 1) Need to remove NA values from each column 2) Loop along data subsets (column "a" in example above) 3) All columns, for each subset, have a max of 1 non-NA value, but some columns may have all NA values

经验法则:1)需要从每一列的数据子集(例如上面的“a”列)中删除每个列的NA值,对于每个子集,每个子集都有1个非NA值的最大值,但是有些列可能具有所有的NA值。

Ideas:

想法:

  • lapply or dplyr is probably helpful to loop along all columns
  • lapply或dplyr可能有助于沿着所有列进行循环
  • na.omit is likely helpful, if the subsetting column that has entries for all rows can be ignored (something like as.data.frame(lapply(dat.admin, na.omit))). issue in returning lapply output to data frame if some subsets don't return any non-NA values
  • na。如果可以忽略包含所有行的条目的子设置列(比如as.data.frame(lapply, data. admin, na.omit)),那么省略可能会很有帮助。如果某些子集不返回任何非na值,则返回lapply输出到数据帧。
  • x[which.min(is.na(x))] effectively accomplishes this if laboriously applied to each individual column
  • x[which.min(is.na(x))]如果费力地应用于每一列,就能有效地实现这一点

Any help is appreciated to put the final pieces together! Thank you!

感谢您的帮助,把最后的部分放在一起!谢谢你!

3 个解决方案

#1


1  

One solution could be achieved using dplyr::summarise_all. The data needs to be group_by on a.

使用dplyr可以实现一种解决方案::summary se_all。数据需要是group_by在a上。

library(dplyr)

dat %>%
  group_by(a) %>%
  summarise_all(funs(.[which.min(is.na(.))]))
# # A tibble: 2 x 4
#    a      b      c      d     
#   <fctr> <fctr> <fctr> <fctr>
# 1   1      B      A      1     
# 2   2      C      A      <NA>  

#2


1  

Solution with data.table and na.omit

解决方案与数据。表和na.omit

library(data.table)
merge(setDT(dat)[,a[1],keyby=a], setDT(dat)[,na.omit(.SD),keyby=a],all.x=TRUE)

I think the merge statement can be improved

我认为合并语句可以得到改进

#3


1  

Not really sure if this is what you're looking for, but this might work for you. It at least replicates the small sample output you're looking for:

不确定这是不是你要找的,但这可能对你有用。它至少复制了您正在寻找的小样本输出:

library(dplyr)
library(tidyr)

dat %>% 
  filter_at(vars(b:c), any_vars(!is.na(.))) %>% 
  group_by(a) %>% 
  fill(b) %>% 
  fill(c) %>% 
  filter_at(vars(b:c), all_vars(!is.na(.)))

# A tibble: 2 x 4
# Groups:   a [2]
       a      b      c      d
  <fctr> <fctr> <fctr> <fctr>
1      1      B      A      1
2      2      C      A     NA

You could also use just dplyr:

你也可以用dplyr:

dat %>%
  group_by(a) %>%
  summarise_each(funs(first(.[!is.na(.)])))  

#1


1  

One solution could be achieved using dplyr::summarise_all. The data needs to be group_by on a.

使用dplyr可以实现一种解决方案::summary se_all。数据需要是group_by在a上。

library(dplyr)

dat %>%
  group_by(a) %>%
  summarise_all(funs(.[which.min(is.na(.))]))
# # A tibble: 2 x 4
#    a      b      c      d     
#   <fctr> <fctr> <fctr> <fctr>
# 1   1      B      A      1     
# 2   2      C      A      <NA>  

#2


1  

Solution with data.table and na.omit

解决方案与数据。表和na.omit

library(data.table)
merge(setDT(dat)[,a[1],keyby=a], setDT(dat)[,na.omit(.SD),keyby=a],all.x=TRUE)

I think the merge statement can be improved

我认为合并语句可以得到改进

#3


1  

Not really sure if this is what you're looking for, but this might work for you. It at least replicates the small sample output you're looking for:

不确定这是不是你要找的,但这可能对你有用。它至少复制了您正在寻找的小样本输出:

library(dplyr)
library(tidyr)

dat %>% 
  filter_at(vars(b:c), any_vars(!is.na(.))) %>% 
  group_by(a) %>% 
  fill(b) %>% 
  fill(c) %>% 
  filter_at(vars(b:c), all_vars(!is.na(.)))

# A tibble: 2 x 4
# Groups:   a [2]
       a      b      c      d
  <fctr> <fctr> <fctr> <fctr>
1      1      B      A      1
2      2      C      A     NA

You could also use just dplyr:

你也可以用dplyr:

dat %>%
  group_by(a) %>%
  summarise_each(funs(first(.[!is.na(.)])))