在数据中创建新的列。表的组

时间:2021-05-18 23:22:09

I have no experience with data.table, so I don't know if there is a solution to my question (30 minutes on Google gave no answer at least), but here it goes.

我没有数据方面的经验。表,所以我不知道是否有我的问题的解决方案(谷歌的30分钟没有给出答案),但是在这里。

With data.frame I often use the following command to check the number of observations of a unique value:

使用data.frame,我经常使用以下命令检查一个惟一值的观察次数:

df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))  

Is there any corresponding method when working with data.table?

使用数据表时是否有相应的方法?

1 个解决方案

#1


4  

Yes, there is. Happily, you've asked about one of the newest features of data.table, added in v1.8.2 :

是的,有。幸运的是,您已经询问了数据的最新特性之一。表,在v1.8.2中增加:

:= by group is now implemented (FR#1491) and sub-assigning to a new column by reference now adds the column automatically (initialized with NA where the sub-assign doesn't touch) (FR#1997). := by group can be combined with all types of i, so := by group includes grouping by i as well as by by. Since := by group is by reference, it should be significantly faster than any method that (directly or indirectly) cbinds the grouped results to DT, since no copy of the (large) DT is made at all. It's a short and natural syntax that can be compounded with other queries.
DT[,newcol:=sum(colB),by=colA]

:= by group现在实现了(FR#1491),通过引用对新列进行子赋值,现在自动添加列(在没有子赋值的地方使用NA初始化)(FR#1997)。:= by group可以与i的所有类型组合在一起,所以:= by group既可以由i进行分组,也可以由by进行分组。因为:= by group是通过引用实现的,所以它应该比(直接或间接)将分组结果与DT绑定在一起的任何方法都要快得多,因为(大的)DT根本没有复制。它是一种简短而自然的语法,可以与其他查询混合使用。DT(newcol:=总和(colB)=可乐)

In your example, iiuc, it should be something like :

在你的例子中,iiuc应该是这样的:

DT[, Obs:=.N, by=ID-Date]

instead of :

而不是:

df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))

Note that := by group scales well for large data sets (and smaller datasets will a lot of small groups).

注意:=对于大型数据集来说,按组伸缩很好(而较小的数据集将会有很多小的组)。

See ?":=" and Search data.table tag for "reference"

查看?“=”和搜索数据。表格标记为“参考”

#1


4  

Yes, there is. Happily, you've asked about one of the newest features of data.table, added in v1.8.2 :

是的,有。幸运的是,您已经询问了数据的最新特性之一。表,在v1.8.2中增加:

:= by group is now implemented (FR#1491) and sub-assigning to a new column by reference now adds the column automatically (initialized with NA where the sub-assign doesn't touch) (FR#1997). := by group can be combined with all types of i, so := by group includes grouping by i as well as by by. Since := by group is by reference, it should be significantly faster than any method that (directly or indirectly) cbinds the grouped results to DT, since no copy of the (large) DT is made at all. It's a short and natural syntax that can be compounded with other queries.
DT[,newcol:=sum(colB),by=colA]

:= by group现在实现了(FR#1491),通过引用对新列进行子赋值,现在自动添加列(在没有子赋值的地方使用NA初始化)(FR#1997)。:= by group可以与i的所有类型组合在一起,所以:= by group既可以由i进行分组,也可以由by进行分组。因为:= by group是通过引用实现的,所以它应该比(直接或间接)将分组结果与DT绑定在一起的任何方法都要快得多,因为(大的)DT根本没有复制。它是一种简短而自然的语法,可以与其他查询混合使用。DT(newcol:=总和(colB)=可乐)

In your example, iiuc, it should be something like :

在你的例子中,iiuc应该是这样的:

DT[, Obs:=.N, by=ID-Date]

instead of :

而不是:

df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))

Note that := by group scales well for large data sets (and smaller datasets will a lot of small groups).

注意:=对于大型数据集来说,按组伸缩很好(而较小的数据集将会有很多小的组)。

See ?":=" and Search data.table tag for "reference"

查看?“=”和搜索数据。表格标记为“参考”