数据。表:为什么不总是可以直接传递列名?

时间:2022-09-01 21:26:22

Getting started with the data.table package (author/maintainer: Matt Dowle). Great package. I love that I can write dt[, x1] instead of, say, dt[, dt$x1] or df["x1"], for a data.table dt, a column name x1, a data.frame df. Being able to pass column names directly is an attractive feature of data.table. But dispensing with quotes around a column name, (writing x1 instead of "x1") is not always feasible. Why?

从数据开始。表包(作者/维护者:Matt Dowle)。伟大的计划。我喜欢用dt[x1]代替dt[dt$x1]或df["x1"]来表示数据。表dt,列名x1, data.frame df。能够直接传递列名是data.table的一个吸引人的特性。但是用引号括起一个列名(写x1而不是“x1”)并不总是可行的。为什么?

Programming Question: Are there any reasons why it is not always possible to pass a vector of column names directly to a data.table or to the helper functions provided by the package? For instance, the subset, merge, and melt functions have been rewritten for the data.table package, but while subset can handle column names directly, merge and melt cannot (see below).

编程问题:为什么不可能总是将列名向量直接传递给数据?表还是包提供的助手函数?例如,子集、合并和熔融函数已经为数据重写。表包,但是子集可以直接处理列名,而merge和melt不能(见下面)。

To clarify, my question is not when or how but why. There are excellent related discussions with very useful tips, e.g. Select / assign to data.table variables which names are stored in a character vector and r - passing variables as data.table column names . With these answers and a bit of trial and error, I'm able to find my way around the quote/unquote distinctions. My question is why is it not currently possible to always dispense with quotes around column names: is there a design to it? is it a transitional situation? are there programming difficulties?

澄清一下,我的问题不是何时或如何,而是为什么。有很好的相关讨论和非常有用的技巧,例如:选择/分配数据。表变量名称存储在字符向量和r -传递变量中作为数据。表列名。有了这些答案和一些尝试和错误,我能够找到我的方法围绕引用/不引用的区别。我的问题是,为什么现在不可能总是用引号括住列名:它有设计吗?这是过渡时期吗?有编程困难吗?

Below, I give some examples and number the examples for clarity.

下面,我将给出一些示例,并对示例进行编号以使其清晰。

# load the package
library("data.table") # because I cannot do install.packages(data.table)!!

(i)

# make a data.table
set.seed(1)
dt <- data.table(id = 1:5, x1 = 1:5, x2 = 5:1, x3 = round(runif(5, 1, 5), 0), key = "id")

I can define the data.table with either id = 1:10 or "id" = 1:10, but I must define the key with key = "id" as key = id does not work:

我可以定义数据。id = 1:10或“id”= 1:10的表格,但我必须定义key =“id”的key = key = id无效:

dt <- data.table(id = 1:5, x1 = 1:5, x2 = 5:1, x3 = round(runif(5, 1, 5), 0), key = id)
##Error in data.table(id = 1:5, x1 = 1:5, x2 = 5:1, x3 = round(runif(5,  : 
##  object 'id' not found

You'd think finding 'id' should be rather easy for a key if it were looking for it among the column names? Would it be programmatically sound to be allowed to drop the quotes on the RHS of key?

如果在列名中查找“id”,您会认为对键来说查找“id”应该相当容易?是否允许通过编程方式在密钥的RHS上删除引号?

(ii)

I can subset with a vector of columns or with a vector of column names:

我可以用列向量或者列名向量来子集:

subset(dt, select = c(x1, x3))
##   x1 x3
##1:  1  2
##2:  2  2
##3:  3  3
##4:  4  5
##5:  5  2

subset(dt, select = c("x1", "x3"))
##   x1 x3
##1:  1  2
##2:  2  2
##3:  3  3
##4:  4  5
##5:  5  2

Nice and flexible.

漂亮的和灵活的。

(iii)

I can merge with a vector of column names:

我可以与列名向量合并:

merge(dt, dt, by = c("x1", "x2"))
##       id x1 x2 x3
##1:  1  1  5  2
##2:  2  2  4  2
##3:  3  3  3  3
##4:  4  4  2  5
##5:  5  5  1  2

(silly example that was!) but not with a vector of the columns:

(这是一个愚蠢的例子!)

merge(dt, dt, by = c(x1, x2))
##Error in merge.data.table(dt, dt, by = c(x1, x2)) : object 'x1' not found

Is there something about merge that prevents it from accepting a vector of columns the way subset does?

有什么关于合并的东西阻止它接受一个列向量就像子集那样?

(iv)

Likewise, melt must take quoted column names (or integers corresponding to the column numbers).

同样,melt也必须采用引用的列名(或与列号对应的整数)。

The help description is specific that melt accepts "character vectors", while the help for merge simply states "vectors of column names," but clearly with merge as with melt character vectors are expected.

帮助描述是特定的,熔体接受“字符向量”,而合并的帮助仅仅是状态“列名称的向量”,但显然与熔体字符向量的合并是期望的。

(v)

In the case of the j argument, quoting variable names is not usually the correct approach:

在j参数的情况下,引用变量名通常不是正确的方法:

# Good:
dt[, .(x1, x2)]
##   x1 x2
##1:  1  5
##2:  2  4
##3:  3  3
##4:  4  2
##5:  5  1

# Bad 
dt[, .("x1", "x2")]
##   V1 V2
##1: x1 x2
# This feature is well documented in the FAQs
# FAQ 2.3: "I'm using c() in the j and getting strange results."

Note to the reader not at all familiar with data.tables that .() is a shorthand for list() and that dt[, c(x1, x2)] is unlikely to be the desired command here -- the j argument of dt[i, j] very much expects a list.

注意,读者对数据并不熟悉。表。()是list()的简写,而dt[, c(x1, x2)]不太可能是这里需要的命令——dt[i, j]的j参数非常需要一个list。

(vi)

However, within the j argument of dt[i, j], the LHS of the "assignment by reference" operator := has a confusing convention.

然而,在dt的j参数[i, j]中,“引用赋值”运算符:=的LHS有一个令人困惑的约定。

If the LHS is a single column, it may be passed without quotes. But if it has multiple columns, they must be passed as a vector of quoted column names. The manual only says "a vector of column names", but experimentation suggests they must be quoted:

如果LHS是单个列,则可以不带引号地传递它。但是如果它有多个列,它们必须作为引用列名的向量传递。手册上只写“列名向量”,但实验表明,必须引用它们:

# Good:
dt[, c("x1", "x2") := NULL][]
##   id x3
##1:  1  2
##2:  2  2
##3:  3  3
##4:  4  5
##5:  5  2

# Bad:
dt[, c(x1, x2) := NULL]
##Error in eval(expr, envir, enclos) : object 'x1' not found

The error message is not particularly enlightening. But now I remember the FAQ's advice, "If 2 or more columns are required, use list() or .() instead." Silly me, c(x1, x2) couldn't work because there is no way to tell where x1 ends and x2 starts. However, .(x1, x2) could work, couldn't it?

错误消息并不是特别有启发性。但是现在我记得FAQ的建议,“如果需要两个或更多的列,请使用list()或.()代替。”愚蠢的我,c(x1, x2)不能工作,因为没有办法知道x1在哪结束,x2在哪开始。但是,(x1, x2)可以,对吧?

# Bad:
dt[, .(x1, x2) := NULL]
##Error in eval(expr, envir, enclos) : object 'x1' not found

No, all things considered, the LHS of := expects a vector of quoted column names. The manual ought to be updated or, if feasible, data.table extended to accept lists of unquoted columns on the LHS.

不,考虑到所有的因素,:=的LHS需要一个引用列名的向量。手册应该更新,或者,如果可行的话,数据。表扩展到接受LHS中未引用列的列表。

Oh wait. To delete multiple column names, can I pass a list of quoted names to LHS? No. Lists are usually desirable, but not on the LHS of :=. The error message is clear:

哦,等一下。要删除多个列名,我可以将引用的列名列表传递给LHS吗?不。列表通常是需要的,但不在:=的LHS中。错误信息是明确的:

# Bad:
dt[, .("x1", "x2") := NULL][]
##Error in `[.data.table`(dt, , `:=`(.("x1", "x2"), NULL)) : 
##  LHS of := must be a symbol, or an atomic vector (column names or positions).

(vii)

The i argument of dt[i] is also designed to accept unquoted columns, i.e. an "expression of column names"

dt[i]的i参数也被设计为接受未引用的列,即“列名称的表达式”

dt[.(x1, x2)]
##   id x1 x2 x3 V2
##1:  1  1  5  2  5
##2:  2  2  4  2  4
##3:  3  3  3  3  3
##4:  4  4  2  5  2
##5:  5  5  1  2  1

Note that if the idea was to subset the two columns x1 and x2, that ought to be done inside the j argument, i.e. dt[,.(x1, x2)]

注意,如果思想是将x1和x2两列的子集,那么这应该在j参数中完成,即dt[,(x1,x2))

dt[.("x1", "x2")]
##Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
##  typeof x.id (integer) != typeof i.V1 (character)

dt[c(x1, x2)]
##id x1 x2 x3
## 1:  1  1  5  2
## 2:  2  2  4  2
## 3:  3  3  3  3
## 4:  4  4  2  5
## 5:  5  5  1  2
## 6:  5  5  1  2
## 7:  4  4  2  5
## 8:  3  3  3  3
## 9:  2  2  4  2
##10:  1  1  5  2

dt[c("x1", "x2")]
##Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
##  typeof x.id (integer) != typeof i.V1 (character)

I have shown here several situations where columns must be passed as x1 or as "x1" and situations where both can be done. These differences can cause confusion to new users like me. I suspect there is more than one reason for these two approaches to coexist. I'd appreciate if someone could clarify the matter, for some of my examples if not for all of them.

我在这里展示了几种情况,列必须以x1或x1的形式传递,以及两种情况都可以实现。这些差异会给像我这样的新用户带来困惑。我怀疑这两种方法共存不止一个原因。如果有人能澄清这件事,我将不胜感激,即使不是所有的例子。

1 个解决方案

#1


5  

(i), (iii) and (iv) sound like Feature Requests (FRs); see here (so, yes, it's partly due to data.table not having reached full maturity).

(i)、(iii)及(iv)听起来像是功能要求;看这里(是的,部分原因是数据。表未达到完全成熟。

As to (v) you said "dt[, c(x1, x2)] is unlikely to be the desired command here", but in fact I have seen situations where that sort of use of c within j is what I'm after. Situations like (v) are what the with argument of [.data.table are for.

至于(v)你说过“dt[, c(x1, x2)]在这里不太可能是我们想要的命令”,但实际上我已经看到过在j中使用c的情况。像(v)这样的情况是(.data)的参数。表。

On (vi) and elsewhere, you suggest "The manual only says 'a vector of column names', but experimentation suggests they must be quoted"; but I think this is unambiguous. A vector of column names means a character vector, which c(x1,x2) is not, unless x1 and x2 are somewhere defined as character vectors themselves. You can also add a FR for documentation on GitHub.

在(vi)和其他地方,你建议“手册只说‘一个列名的向量’,但实验表明它们必须被引用”;但我认为这是明确的。列名向量表示字符向量,而c(x1,x2)不是,除非x1和x2在某个地方被定义为字符向量本身。您还可以为GitHub上的文档添加一个FR。

I'm not sure what you're after in (vii), but in i, vectors of names are used for joins or keyed subsets (also a form of join); see the vignette on fast subsetting.

我不确定你在(vii)后面是什么,但在I中,名称的向量是用于连接或键控子集(也是一种连接形式);参见快速子设置的插图。

#1


5  

(i), (iii) and (iv) sound like Feature Requests (FRs); see here (so, yes, it's partly due to data.table not having reached full maturity).

(i)、(iii)及(iv)听起来像是功能要求;看这里(是的,部分原因是数据。表未达到完全成熟。

As to (v) you said "dt[, c(x1, x2)] is unlikely to be the desired command here", but in fact I have seen situations where that sort of use of c within j is what I'm after. Situations like (v) are what the with argument of [.data.table are for.

至于(v)你说过“dt[, c(x1, x2)]在这里不太可能是我们想要的命令”,但实际上我已经看到过在j中使用c的情况。像(v)这样的情况是(.data)的参数。表。

On (vi) and elsewhere, you suggest "The manual only says 'a vector of column names', but experimentation suggests they must be quoted"; but I think this is unambiguous. A vector of column names means a character vector, which c(x1,x2) is not, unless x1 and x2 are somewhere defined as character vectors themselves. You can also add a FR for documentation on GitHub.

在(vi)和其他地方,你建议“手册只说‘一个列名的向量’,但实验表明它们必须被引用”;但我认为这是明确的。列名向量表示字符向量,而c(x1,x2)不是,除非x1和x2在某个地方被定义为字符向量本身。您还可以为GitHub上的文档添加一个FR。

I'm not sure what you're after in (vii), but in i, vectors of names are used for joins or keyed subsets (also a form of join); see the vignette on fast subsetting.

我不确定你在(vii)后面是什么,但在I中,名称的向量是用于连接或键控子集(也是一种连接形式);参见快速子设置的插图。