如何将几个因子类型列中非空的元素总和相加?

时间:2023-01-22 22:56:31

Provided a data frame like this one:

提供了像这样的数据框:

df <- data.frame(list(Group = c("Group1", "Group1", "Group2", "Group2"),
                      A=c("Some text", "Text here too", "Some other text", NA), 
                      B=c(NA, "Some random text", NA, "Random here too")))
> df
   Group               A                B
1 Group1       Some text             <NA>
2 Group1   Text here too Some random text
3 Group2 Some other text             <NA>
4 Group2            <NA>  Random here too

I would like to sum all values in columns A and B that have some values and then sum them per each group independently, resulting in the following data frame:

我想对A列和B列中具有一些值的所有值求和,然后根据每个组对它们求和,得到以下数据帧:

> df.expected
   Group A_n B_n
1 Group1   2   1
2 Group2   1   1

Although this is a silly data frame example (the original data frame has far more columns and groups and it's not so easy to manually achieve the results), I am not succeeding due to the fact that I can't operate with factors. Additionally, I'm afraid my approach (see below) is too verbose and maybe overkill, and it makes it not very suitable for my real data frame, with far more columns.

虽然这是一个愚蠢的数据框示例(原始数据框具有更多的列和组,并且手动实现结果并不容易),但由于我无法使用因子操作,因此我没有成功。另外,我担心我的方法(见下文)过于冗长而且可能有些过分,这使得它不太适合我的真实数据框架,而且列数更多。

That's what I've done so far:

这就是我到目前为止所做的事情:

# Manually create a new numeric column with numbers.
df$A_n = as.character(df$A)
df$A_n[!is.na(df$A_n)] <- 1
df$A_n = as.numeric(df$A_n)

df$B_n = as.character(df$B)
df$B_n[!is.na(df$B_n)] <- 1
df$B_n = as.numeric(df$B_n)

This part is working fine, although I'm afraid there might be a better and shorter/semiautomated way to create new columns and assign them a value. Maybe it's even unnecessary.

这部分工作正常,但我担心可能有更好,更短/半自动的方式来创建新列并为它们赋值。也许它甚至没有必要。

The second part of my code is aimed to group the observations according to a grouping variable and sum the values in each variable using dplyr:

我的代码的第二部分旨在根据分组变量对观察结果进行分组,并使用dplyr对每个变量中的值求和:

library(dplyr)  

df2 = df %>% 
      select(Group, A_n, B_n) %>% 
      group_by(Group) %>% 
      summarise_all(sum)

However, I am getting unexpected data frame:

但是,我得到了意想不到的数据框:

> df2
# A tibble: 2 x 3
   Group   A_n   B_n
  <fctr> <dbl> <dbl>
1 Group1     2    NA
2 Group2    NA    NA

Can anyone help me in how to tackle this problem in a better way and/or tell me what am I doing wrong with dplyr's code block?

任何人都可以帮我解决如何以更好的方式解决这个问题和/或告诉我dplyr的代码块我做错了什么?

2 个解决方案

#1


1  

What am I doing wrong with dplyr's code block?

It's because there are NAs. Try

这是因为有NA。尝试

library(dplyr)  

df2 = df %>% 
      select(Group, A_n, B_n) %>% 
      group_by(Group) %>% 
      summarise_all(sum, na.rm=TRUE)

instead.

Output on my machine:

我机器上的输出:

# A tibble: 2 x 3
   Group   A_n   B_n
  <fctr> <dbl> <dbl>
1 Group1     2     1
2 Group2     1     1

I'm afraid my approach ... is too verbose and maybe overkill

You can just do this:

你可以这样做:

df <- data.frame(list(Group = c("Group1", "Group1", "Group2", "Group2"),
                      A=c("Some text", "Text here too", "Some other text", NA), 
                      B=c(NA, "Some random text", NA, "Random here too")))

library(dplyr)

df2 = df %>% 
    group_by(Group) %>% 
    summarise_all(.funs=function(x) length(na.omit(x)))

Output on my machine:

我机器上的输出:

# A tibble: 2 x 3
   Group     A     B
  <fctr> <int> <int>
1 Group1     2     1
2 Group2     1     1

A little explanation

If you look at help(summarise_all), you'll see its arguments are .tbl, .funs, and ... (which we won't worry about the ellipses for now). So, we feed df into group_by() using the pipe %>%, then feed that into summarise_all(), again using the pipe %>%. That takes care of the .tbl argument. The .funs argument is how you specify what function(s) should be used to summarise to all non-grouping columns in .tbl. Here we want to know how many elements of each column is not NA, which we can do (as one approach) by applying length(na.omit(x)) to each non-grouping column x in .tbl.

如果你看一下help(summarise_all),你会看到它的参数是.tbl,.funs和...(我们现在不用担心省略号)。因此,我们使用管道%>%将df输入group_by(),然后再使用管道%>%将其输入summarise_all()。这会处理.tbl参数。 .funs参数是指定应该使用哪些函数汇总到.tbl中所有非分组列的方法。在这里,我们想知道每列的多少元素不是NA,我们可以通过将长度(na.omit(x))应用于.tbl中的每个非分组列x来做(作为一种方法)。

My best suggestion for a resource to learn about dplyr is Chapter 5 of R for Data Science, a book by Hadley Wickham, who wrote the dplyr package (among many others).

关于dplyr的资源我最好的建议是R for Data Science的第5章,Hadley Wickham的一本书,他编写了dplyr软件包(以及其他许多软件包)。

#2


1  

In base R, you can use aggregate with the standard interface (as opposed to the formula interface).

在基数R中,您可以将聚合与标准接口一起使用(而不是公式接口)。

aggregate(cbind(A_n=df$A, B_n=df$B),  df["Group"], function(x) sum(!is.na(x)))
   Group A_n B_n
1 Group1   2   1
2 Group2   1   1

cbind the variables to be calculated and provide there names. In the second argument, include the grouping variables. Then, as you function, sum over na indicator of elements that are not missing.

cbind要计算的变量并提供名称。在第二个参数中,包括分组变量。然后,在您运行时,将未缺失的元素的指示符相加。

#1


1  

What am I doing wrong with dplyr's code block?

It's because there are NAs. Try

这是因为有NA。尝试

library(dplyr)  

df2 = df %>% 
      select(Group, A_n, B_n) %>% 
      group_by(Group) %>% 
      summarise_all(sum, na.rm=TRUE)

instead.

Output on my machine:

我机器上的输出:

# A tibble: 2 x 3
   Group   A_n   B_n
  <fctr> <dbl> <dbl>
1 Group1     2     1
2 Group2     1     1

I'm afraid my approach ... is too verbose and maybe overkill

You can just do this:

你可以这样做:

df <- data.frame(list(Group = c("Group1", "Group1", "Group2", "Group2"),
                      A=c("Some text", "Text here too", "Some other text", NA), 
                      B=c(NA, "Some random text", NA, "Random here too")))

library(dplyr)

df2 = df %>% 
    group_by(Group) %>% 
    summarise_all(.funs=function(x) length(na.omit(x)))

Output on my machine:

我机器上的输出:

# A tibble: 2 x 3
   Group     A     B
  <fctr> <int> <int>
1 Group1     2     1
2 Group2     1     1

A little explanation

If you look at help(summarise_all), you'll see its arguments are .tbl, .funs, and ... (which we won't worry about the ellipses for now). So, we feed df into group_by() using the pipe %>%, then feed that into summarise_all(), again using the pipe %>%. That takes care of the .tbl argument. The .funs argument is how you specify what function(s) should be used to summarise to all non-grouping columns in .tbl. Here we want to know how many elements of each column is not NA, which we can do (as one approach) by applying length(na.omit(x)) to each non-grouping column x in .tbl.

如果你看一下help(summarise_all),你会看到它的参数是.tbl,.funs和...(我们现在不用担心省略号)。因此,我们使用管道%>%将df输入group_by(),然后再使用管道%>%将其输入summarise_all()。这会处理.tbl参数。 .funs参数是指定应该使用哪些函数汇总到.tbl中所有非分组列的方法。在这里,我们想知道每列的多少元素不是NA,我们可以通过将长度(na.omit(x))应用于.tbl中的每个非分组列x来做(作为一种方法)。

My best suggestion for a resource to learn about dplyr is Chapter 5 of R for Data Science, a book by Hadley Wickham, who wrote the dplyr package (among many others).

关于dplyr的资源我最好的建议是R for Data Science的第5章,Hadley Wickham的一本书,他编写了dplyr软件包(以及其他许多软件包)。

#2


1  

In base R, you can use aggregate with the standard interface (as opposed to the formula interface).

在基数R中,您可以将聚合与标准接口一起使用(而不是公式接口)。

aggregate(cbind(A_n=df$A, B_n=df$B),  df["Group"], function(x) sum(!is.na(x)))
   Group A_n B_n
1 Group1   2   1
2 Group2   1   1

cbind the variables to be calculated and provide there names. In the second argument, include the grouping variables. Then, as you function, sum over na indicator of elements that are not missing.

cbind要计算的变量并提供名称。在第二个参数中,包括分组变量。然后,在您运行时,将未缺失的元素的指示符相加。