如何将组中唯一值的计数添加到R data.frame中

时间:2022-01-10 22:46:12

I wish to count the number of unique values by grouping of a second variable, and then add the count to the existing data.frame as a new column. For example, if the existing data frame looks like this:

我希望通过分组第二个变量来计数惟一值的数量,然后将计数添加到现有的data.frame作为新列。例如,如果现有的数据帧是这样的:

  color  type1 black chair2 black chair3 black  sofa4 green  sofa5 green  sofa6   red  sofa7   red plate8  blue  sofa9  blue plate10 blue chair

I want to add for each color, the count of unique types that are present in the data:

我想为每一种颜色加上数据中唯一类型的计数:

  color  type unique_types1 black chair            22 black chair            23 black  sofa            24 green  sofa            15 green  sofa            16   red  sofa            27   red plate            28  blue  sofa            39  blue plate            310 blue chair            3

I was hoping to use ave, but can't seem to find a straightforward method that doesn't require many lines. I have >100,000 rows, so am also not sure how important efficiency is.

我希望使用ave,但似乎找不到一个简单的方法,不需要很多行。我有>100,000行,所以我也不确定效率有多重要。

It's somewhat similar to this issue: Count number of observations/rows per group and add result to data frame

它有点类似于这个问题:计算每个组的观察数/行数并向数据帧添加结果

3 个解决方案

#1


40  

Using ave (since you ask for it specifically):

使用ave(具体要求):

within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})

Make sure that type is character vector and not factor.

确保类型是字符向量而不是因子。


Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table solution as well.

既然你也说你的数据是巨大的,而且速度/性能可能是一个因素,我建议一个数据。表解决方案。

require(data.table)setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+# if you don't want df to be modified by referenceans = as.data.table(df)[, count := uniqueN(type), by = color]

uniqueN was implemented in v1.9.6 and is a faster equivalent of length(unique(.)). In addition it also works with data.frames/data.tables.

uniqueN在v1.9.6中被采用,并且是一个体现长度的快速版本(unique(.)。此外,它还适用于data.frame /data.tables。


Other solutions:

其他的解决方案:

Using plyr:

使用plyr:

require(plyr)ddply(df, .(color), mutate, count = length(unique(type)))

Using aggregate:

使用聚合:

agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))merge(df, agg, by="color", all=TRUE)

#2


32  

Here's a solution with the dplyr package - it has n_distinct() as a wrapper for length(unique()).

这里有一个dplyr包的解决方案——它有n_distinct()作为长度(unique())的包装。

df %>%  group_by(color) %>%  mutate(unique_types = n_distinct(type))

#3


4  

This can be also achieved in a vectorized without by group operations by combining unique with table or tabulate

这也可以在不通过分组操作的情况下,通过与表或表格相结合的方式实现。

If df$color is factor, then

如果df$color是因数,则

Either

要么

table(unique(df)$color)[as.character(df$color)]# black black black green green   red   red  blue  blue  blue #    2     2     2     1     1     2     2     3     3     3 

Or

tabulate(unique(df)$color)[as.integer(df$color)]# [1] 2 2 2 1 1 2 2 3 3 3

If df$color is character then just

如果df$color就是字符

table(unique(df)$color)[df$color]

If df$color is an integer then just

如果df$color是一个整数,那么

tabulate(unique(df)$color)[df$color]

#1


40  

Using ave (since you ask for it specifically):

使用ave(具体要求):

within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})

Make sure that type is character vector and not factor.

确保类型是字符向量而不是因子。


Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table solution as well.

既然你也说你的数据是巨大的,而且速度/性能可能是一个因素,我建议一个数据。表解决方案。

require(data.table)setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+# if you don't want df to be modified by referenceans = as.data.table(df)[, count := uniqueN(type), by = color]

uniqueN was implemented in v1.9.6 and is a faster equivalent of length(unique(.)). In addition it also works with data.frames/data.tables.

uniqueN在v1.9.6中被采用,并且是一个体现长度的快速版本(unique(.)。此外,它还适用于data.frame /data.tables。


Other solutions:

其他的解决方案:

Using plyr:

使用plyr:

require(plyr)ddply(df, .(color), mutate, count = length(unique(type)))

Using aggregate:

使用聚合:

agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))merge(df, agg, by="color", all=TRUE)

#2


32  

Here's a solution with the dplyr package - it has n_distinct() as a wrapper for length(unique()).

这里有一个dplyr包的解决方案——它有n_distinct()作为长度(unique())的包装。

df %>%  group_by(color) %>%  mutate(unique_types = n_distinct(type))

#3


4  

This can be also achieved in a vectorized without by group operations by combining unique with table or tabulate

这也可以在不通过分组操作的情况下,通过与表或表格相结合的方式实现。

If df$color is factor, then

如果df$color是因数,则

Either

要么

table(unique(df)$color)[as.character(df$color)]# black black black green green   red   red  blue  blue  blue #    2     2     2     1     1     2     2     3     3     3 

Or

tabulate(unique(df)$color)[as.integer(df$color)]# [1] 2 2 2 1 1 2 2 3 3 3

If df$color is character then just

如果df$color就是字符

table(unique(df)$color)[df$color]

If df$color is an integer then just

如果df$color是一个整数,那么

tabulate(unique(df)$color)[df$color]