如何使用Dplyr的Summarize和which()查找最小/最大值

时间:2021-11-04 20:16:11

I have the following data:

我有以下数据:

Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,12,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D") 
data <- data.frame(Name, Age, Group)

And I'd like to use dplyr to

我想用dplyr来

(1) group the data by "Group" (2) show the min and max Age within each Group (3) show the Name of the person with the min and max ages

(1)按“组”分组数据(2)显示每组内的最小和最大年龄(3)显示最小和最大年龄的人的姓名

The following code does this:

以下代码执行此操作:

data %>% group_by(Group) %>%
     summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))], 
               maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])

Which works well:

哪个效果很好:

  Group minAge minAgeName maxAge maxAgeName
1     A     22        Sam     22        Sam
2     B     12      Sarah     58      James
3     C     17     Andrew     82      Sally
4     D     12     Mairin     67        Ray

However, I have a problem if there are multiple min or max values:

但是,如果有多个最小值或最大值,我会遇到问题:

Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,31,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D") 
data <- data.frame(Name, Age, Group)

> data %>% group_by(Group) %>%
+   summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))], 
+             maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
Error: expecting a single value

I'm looking for two solutions:

我正在寻找两种解决方案:

(1) where it doesn't matter which min or max name is shown, just that one is shown (i.e., the first value found) (2) where if there are "ties" all minimum values and maximum values are shown

(1)如果显示哪个最小或最大名称无关紧要,只显示一个(即找到的第一个值)(2)如果存在“关系”,则显示所有最小值和最大值

Please let me know if this isn't clear and thanks in advance!

如果不清楚请提前告知我们并提前致谢!

3 个解决方案

#1


8  

I would actually recommend keeping your data in a "long" format. Here's how I would approach this:

我实际上建议您将数据保持为“长”格式。这是我如何处理这个问题:

library(dplyr)

Keeping all values when there are ties:

有关系时保留所有价值观:

data %>%
  group_by(Group) %>%
  arrange(Age) %>%  ## optional
  filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
# 
#     Name Age Group
# 1    Sam  22     A
# 2  Sarah  31     B
# 3    Jim  31     B
# 4  James  58     B
# 5 Andrew  17     C
# 6  Sally  82     C
# 7 Mairin  12     D
# 8    Ray  67     D

Keeping only one value when there are ties:

有关系时只保留一个值:

data %>%
  group_by(Group) %>%
  arrange(Age) %>%
  slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
# 
#     Name Age Group
# 1    Sam  22     A
# 2  Sarah  31     B
# 3  James  58     B
# 4 Andrew  17     C
# 5  Sally  82     C
# 6 Mairin  12     D
# 7    Ray  67     D

If you really want a "wide" dataset, the basic concept would be to gather and spread the data, using "tidyr":

如果你真的想要一个“广泛”的数据集,那么基本的概念就是使用“tidyr”来收集和传播数据:

library(dplyr)
library(tidyr)

data %>%
  group_by(Group) %>%
  arrange(Age) %>%
  slice(c(1, n())) %>%
  mutate(minmax = c("min", "max")) %>%
  gather(var, val, Name:Age) %>%
  unite(key, minmax, var) %>%
  spread(key, val)
# Source: local data frame [4 x 5]
# 
#   Group max_Age max_Name min_Age min_Name
# 1     A      22      Sam      22      Sam
# 2     B      58    James      31    Sarah
# 3     C      82    Sally      17   Andrew
# 4     D      67      Ray      12   Mairin

Though what wide form you would want with ties is unclear.

虽然你想要的关系是什么样的广泛形式尚不清楚。

#2


8  

You can use which.min and which.max to get the first value.

您可以使用which.min和which.max来获取第一个值。

data %>% group_by(Group) %>%
  summarize(minAge = min(Age), minAgeName = Name[which.min(Age)], 
            maxAge = max(Age), maxAgeName = Name[which.max(Age)])

To get all values, use e.g. paste with an appropriate collapse argument.

要获取所有值,请使用例如使用适当的折叠参数粘贴。

data %>% group_by(Group) %>%
  summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "), 
            maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))

#3


3  

Here are some data.table approaches, the first one borrowed from @akrun:

以下是一些data.table方法,第一个从@akrun借来的方法:

setDT(data)

# show one, wide format
data[,c(min=.SD[which.min(Age)],max=.SD[which.max(Age)]),by=Group]
   # Group min.Name min.Age max.Name max.Age
# 1:     A      Sam      22      Sam      22
# 2:     B    Sarah      31    James      58
# 3:     C   Andrew      17    Sally      82
# 4:     D   Mairin      12      Ray      67

# show all, long format
data[,{
  mina=min(Age)
  maxa=max(Age)
  rbind(
    data.table(minmax="min",Age=mina,Name=Name[which(Age==mina)]),
    data.table(minmax="max",Age=maxa,Name=Name[which(Age==maxa)])
)},by=Group]
   # Group minmax Age   Name
# 1:     A    min  22    Sam
# 2:     A    max  22    Sam
# 3:     B    min  31  Sarah
# 4:     B    min  31    Jim
# 5:     B    max  58  James
# 6:     C    min  17 Andrew
# 7:     C    max  82  Sally
# 8:     D    min  12 Mairin
# 9:     D    max  67    Ray    

I think the long format is the best, since it allows you to filter with minmax, but the code is tortured and inefficient.

我认为长格式是最好的,因为它允许你用minmax过滤,但代码是折磨和低效。

Here are some arguably less good ways:

以下是一些不太好的方法:

# show all, wide format (with a list column)
data[,{
  mina=min(Age)
  maxa=max(Age)
  list(
    minAge=mina,
    maxAge=maxa,
    minNames=list(Name[Age==mina]),
    maxNames=list(Name[Age==maxa]))
},by=Group]
   # Group minAge maxAge  minNames maxNames
# 1:     A     22     22       Sam      Sam
# 2:     B     31     58 Sarah,Jim    James
# 3:     C     17     82    Andrew    Sally
# 4:     D     12     67    Mairin      Ray


# show all, wide format (with a string column)
# (just look at @shadow's answer)

#1


8  

I would actually recommend keeping your data in a "long" format. Here's how I would approach this:

我实际上建议您将数据保持为“长”格式。这是我如何处理这个问题:

library(dplyr)

Keeping all values when there are ties:

有关系时保留所有价值观:

data %>%
  group_by(Group) %>%
  arrange(Age) %>%  ## optional
  filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
# 
#     Name Age Group
# 1    Sam  22     A
# 2  Sarah  31     B
# 3    Jim  31     B
# 4  James  58     B
# 5 Andrew  17     C
# 6  Sally  82     C
# 7 Mairin  12     D
# 8    Ray  67     D

Keeping only one value when there are ties:

有关系时只保留一个值:

data %>%
  group_by(Group) %>%
  arrange(Age) %>%
  slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
# 
#     Name Age Group
# 1    Sam  22     A
# 2  Sarah  31     B
# 3  James  58     B
# 4 Andrew  17     C
# 5  Sally  82     C
# 6 Mairin  12     D
# 7    Ray  67     D

If you really want a "wide" dataset, the basic concept would be to gather and spread the data, using "tidyr":

如果你真的想要一个“广泛”的数据集,那么基本的概念就是使用“tidyr”来收集和传播数据:

library(dplyr)
library(tidyr)

data %>%
  group_by(Group) %>%
  arrange(Age) %>%
  slice(c(1, n())) %>%
  mutate(minmax = c("min", "max")) %>%
  gather(var, val, Name:Age) %>%
  unite(key, minmax, var) %>%
  spread(key, val)
# Source: local data frame [4 x 5]
# 
#   Group max_Age max_Name min_Age min_Name
# 1     A      22      Sam      22      Sam
# 2     B      58    James      31    Sarah
# 3     C      82    Sally      17   Andrew
# 4     D      67      Ray      12   Mairin

Though what wide form you would want with ties is unclear.

虽然你想要的关系是什么样的广泛形式尚不清楚。

#2


8  

You can use which.min and which.max to get the first value.

您可以使用which.min和which.max来获取第一个值。

data %>% group_by(Group) %>%
  summarize(minAge = min(Age), minAgeName = Name[which.min(Age)], 
            maxAge = max(Age), maxAgeName = Name[which.max(Age)])

To get all values, use e.g. paste with an appropriate collapse argument.

要获取所有值,请使用例如使用适当的折叠参数粘贴。

data %>% group_by(Group) %>%
  summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "), 
            maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))

#3


3  

Here are some data.table approaches, the first one borrowed from @akrun:

以下是一些data.table方法,第一个从@akrun借来的方法:

setDT(data)

# show one, wide format
data[,c(min=.SD[which.min(Age)],max=.SD[which.max(Age)]),by=Group]
   # Group min.Name min.Age max.Name max.Age
# 1:     A      Sam      22      Sam      22
# 2:     B    Sarah      31    James      58
# 3:     C   Andrew      17    Sally      82
# 4:     D   Mairin      12      Ray      67

# show all, long format
data[,{
  mina=min(Age)
  maxa=max(Age)
  rbind(
    data.table(minmax="min",Age=mina,Name=Name[which(Age==mina)]),
    data.table(minmax="max",Age=maxa,Name=Name[which(Age==maxa)])
)},by=Group]
   # Group minmax Age   Name
# 1:     A    min  22    Sam
# 2:     A    max  22    Sam
# 3:     B    min  31  Sarah
# 4:     B    min  31    Jim
# 5:     B    max  58  James
# 6:     C    min  17 Andrew
# 7:     C    max  82  Sally
# 8:     D    min  12 Mairin
# 9:     D    max  67    Ray    

I think the long format is the best, since it allows you to filter with minmax, but the code is tortured and inefficient.

我认为长格式是最好的,因为它允许你用minmax过滤,但代码是折磨和低效。

Here are some arguably less good ways:

以下是一些不太好的方法:

# show all, wide format (with a list column)
data[,{
  mina=min(Age)
  maxa=max(Age)
  list(
    minAge=mina,
    maxAge=maxa,
    minNames=list(Name[Age==mina]),
    maxNames=list(Name[Age==maxa]))
},by=Group]
   # Group minAge maxAge  minNames maxNames
# 1:     A     22     22       Sam      Sam
# 2:     B     31     58 Sarah,Jim    James
# 3:     C     17     82    Andrew    Sally
# 4:     D     12     67    Mairin      Ray


# show all, wide format (with a string column)
# (just look at @shadow's answer)