I have the following data:
我有以下数据:
Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,12,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D")
data <- data.frame(Name, Age, Group)
And I'd like to use dplyr to
我想用dplyr来
(1) group the data by "Group" (2) show the min and max Age within each Group (3) show the Name of the person with the min and max ages
(1)按“组”分组数据(2)显示每组内的最小和最大年龄(3)显示最小和最大年龄的人的姓名
The following code does this:
以下代码执行此操作:
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))],
maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
Which works well:
哪个效果很好:
Group minAge minAgeName maxAge maxAgeName
1 A 22 Sam 22 Sam
2 B 12 Sarah 58 James
3 C 17 Andrew 82 Sally
4 D 12 Mairin 67 Ray
However, I have a problem if there are multiple min or max values:
但是,如果有多个最小值或最大值,我会遇到问题:
Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,31,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D")
data <- data.frame(Name, Age, Group)
> data %>% group_by(Group) %>%
+ summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))],
+ maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
Error: expecting a single value
I'm looking for two solutions:
我正在寻找两种解决方案:
(1) where it doesn't matter which min or max name is shown, just that one is shown (i.e., the first value found) (2) where if there are "ties" all minimum values and maximum values are shown
(1)如果显示哪个最小或最大名称无关紧要,只显示一个(即找到的第一个值)(2)如果存在“关系”,则显示所有最小值和最大值
Please let me know if this isn't clear and thanks in advance!
如果不清楚请提前告知我们并提前致谢!
3 个解决方案
#1
8
I would actually recommend keeping your data in a "long" format. Here's how I would approach this:
我实际上建议您将数据保持为“长”格式。这是我如何处理这个问题:
library(dplyr)
Keeping all values when there are ties:
有关系时保留所有价值观:
data %>%
group_by(Group) %>%
arrange(Age) %>% ## optional
filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 Jim 31 B
# 4 James 58 B
# 5 Andrew 17 C
# 6 Sally 82 C
# 7 Mairin 12 D
# 8 Ray 67 D
Keeping only one value when there are ties:
有关系时只保留一个值:
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 James 58 B
# 4 Andrew 17 C
# 5 Sally 82 C
# 6 Mairin 12 D
# 7 Ray 67 D
If you really want a "wide" dataset, the basic concept would be to gather
and spread
the data, using "tidyr":
如果你真的想要一个“广泛”的数据集,那么基本的概念就是使用“tidyr”来收集和传播数据:
library(dplyr)
library(tidyr)
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(c(1, n())) %>%
mutate(minmax = c("min", "max")) %>%
gather(var, val, Name:Age) %>%
unite(key, minmax, var) %>%
spread(key, val)
# Source: local data frame [4 x 5]
#
# Group max_Age max_Name min_Age min_Name
# 1 A 22 Sam 22 Sam
# 2 B 58 James 31 Sarah
# 3 C 82 Sally 17 Andrew
# 4 D 67 Ray 12 Mairin
Though what wide form you would want with ties is unclear.
虽然你想要的关系是什么样的广泛形式尚不清楚。
#2
8
You can use which.min
and which.max
to get the first value.
您可以使用which.min和which.max来获取第一个值。
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = Name[which.min(Age)],
maxAge = max(Age), maxAgeName = Name[which.max(Age)])
To get all values, use e.g. paste with an appropriate collapse
argument.
要获取所有值,请使用例如使用适当的折叠参数粘贴。
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "),
maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))
#3
3
Here are some data.table
approaches, the first one borrowed from @akrun:
以下是一些data.table方法,第一个从@akrun借来的方法:
setDT(data)
# show one, wide format
data[,c(min=.SD[which.min(Age)],max=.SD[which.max(Age)]),by=Group]
# Group min.Name min.Age max.Name max.Age
# 1: A Sam 22 Sam 22
# 2: B Sarah 31 James 58
# 3: C Andrew 17 Sally 82
# 4: D Mairin 12 Ray 67
# show all, long format
data[,{
mina=min(Age)
maxa=max(Age)
rbind(
data.table(minmax="min",Age=mina,Name=Name[which(Age==mina)]),
data.table(minmax="max",Age=maxa,Name=Name[which(Age==maxa)])
)},by=Group]
# Group minmax Age Name
# 1: A min 22 Sam
# 2: A max 22 Sam
# 3: B min 31 Sarah
# 4: B min 31 Jim
# 5: B max 58 James
# 6: C min 17 Andrew
# 7: C max 82 Sally
# 8: D min 12 Mairin
# 9: D max 67 Ray
I think the long format is the best, since it allows you to filter with minmax
, but the code is tortured and inefficient.
我认为长格式是最好的,因为它允许你用minmax过滤,但代码是折磨和低效。
Here are some arguably less good ways:
以下是一些不太好的方法:
# show all, wide format (with a list column)
data[,{
mina=min(Age)
maxa=max(Age)
list(
minAge=mina,
maxAge=maxa,
minNames=list(Name[Age==mina]),
maxNames=list(Name[Age==maxa]))
},by=Group]
# Group minAge maxAge minNames maxNames
# 1: A 22 22 Sam Sam
# 2: B 31 58 Sarah,Jim James
# 3: C 17 82 Andrew Sally
# 4: D 12 67 Mairin Ray
# show all, wide format (with a string column)
# (just look at @shadow's answer)
#1
8
I would actually recommend keeping your data in a "long" format. Here's how I would approach this:
我实际上建议您将数据保持为“长”格式。这是我如何处理这个问题:
library(dplyr)
Keeping all values when there are ties:
有关系时保留所有价值观:
data %>%
group_by(Group) %>%
arrange(Age) %>% ## optional
filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 Jim 31 B
# 4 James 58 B
# 5 Andrew 17 C
# 6 Sally 82 C
# 7 Mairin 12 D
# 8 Ray 67 D
Keeping only one value when there are ties:
有关系时只保留一个值:
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 James 58 B
# 4 Andrew 17 C
# 5 Sally 82 C
# 6 Mairin 12 D
# 7 Ray 67 D
If you really want a "wide" dataset, the basic concept would be to gather
and spread
the data, using "tidyr":
如果你真的想要一个“广泛”的数据集,那么基本的概念就是使用“tidyr”来收集和传播数据:
library(dplyr)
library(tidyr)
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(c(1, n())) %>%
mutate(minmax = c("min", "max")) %>%
gather(var, val, Name:Age) %>%
unite(key, minmax, var) %>%
spread(key, val)
# Source: local data frame [4 x 5]
#
# Group max_Age max_Name min_Age min_Name
# 1 A 22 Sam 22 Sam
# 2 B 58 James 31 Sarah
# 3 C 82 Sally 17 Andrew
# 4 D 67 Ray 12 Mairin
Though what wide form you would want with ties is unclear.
虽然你想要的关系是什么样的广泛形式尚不清楚。
#2
8
You can use which.min
and which.max
to get the first value.
您可以使用which.min和which.max来获取第一个值。
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = Name[which.min(Age)],
maxAge = max(Age), maxAgeName = Name[which.max(Age)])
To get all values, use e.g. paste with an appropriate collapse
argument.
要获取所有值,请使用例如使用适当的折叠参数粘贴。
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "),
maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))
#3
3
Here are some data.table
approaches, the first one borrowed from @akrun:
以下是一些data.table方法,第一个从@akrun借来的方法:
setDT(data)
# show one, wide format
data[,c(min=.SD[which.min(Age)],max=.SD[which.max(Age)]),by=Group]
# Group min.Name min.Age max.Name max.Age
# 1: A Sam 22 Sam 22
# 2: B Sarah 31 James 58
# 3: C Andrew 17 Sally 82
# 4: D Mairin 12 Ray 67
# show all, long format
data[,{
mina=min(Age)
maxa=max(Age)
rbind(
data.table(minmax="min",Age=mina,Name=Name[which(Age==mina)]),
data.table(minmax="max",Age=maxa,Name=Name[which(Age==maxa)])
)},by=Group]
# Group minmax Age Name
# 1: A min 22 Sam
# 2: A max 22 Sam
# 3: B min 31 Sarah
# 4: B min 31 Jim
# 5: B max 58 James
# 6: C min 17 Andrew
# 7: C max 82 Sally
# 8: D min 12 Mairin
# 9: D max 67 Ray
I think the long format is the best, since it allows you to filter with minmax
, but the code is tortured and inefficient.
我认为长格式是最好的,因为它允许你用minmax过滤,但代码是折磨和低效。
Here are some arguably less good ways:
以下是一些不太好的方法:
# show all, wide format (with a list column)
data[,{
mina=min(Age)
maxa=max(Age)
list(
minAge=mina,
maxAge=maxa,
minNames=list(Name[Age==mina]),
maxNames=list(Name[Age==maxa]))
},by=Group]
# Group minAge maxAge minNames maxNames
# 1: A 22 22 Sam Sam
# 2: B 31 58 Sarah,Jim James
# 3: C 17 82 Andrew Sally
# 4: D 12 67 Mairin Ray
# show all, wide format (with a string column)
# (just look at @shadow's answer)