用R和MATLAB进行高维数据结构设计的方法

时间:2020-12-15 21:37:48

Question

What is the right way to structure multivariate data with categorical labels accumulated over repeated trials for exploratory analysis in R? I don't want to slip back to MATLAB.

用在R中反复试验积累的分类标签构造多元数据的正确方法是什么?我不想回到MATLAB中。


Explanation

I like R's analysis functions and syntax (and stunning plots) much better than MATLAB's, and have been working hard to refactor my stuff over. However, I keep getting hung up on the way data is organized in my work.

我喜欢R的分析函数和语法(和惊人的情节)比MATLAB的好得多,并且一直在努力重构我的东西。然而,我总是对数据在工作中的组织方式感到厌烦。

MATLAB

It's typical for me to work with multivariate time series repeated over many trials, which are stored in a big matrix rank-3 tensor multidimensional array of SERIESxSAMPLESxTRIALS. This lends itself to some nice linear algebra stuff occasionally, but is clumsy when it comes to another variable, namely CLASS. Typically class labels are stored in another vector of dimension 1xTRIALS.

对我来说,处理多次试验重复的多变量时间序列是很典型的,它们存储在一个大的矩阵秩-3张量多维序列中。它偶尔也会用到一些很好的线性代数知识,但当它涉及到另一个变量,即类时就显得很笨拙了。典型的类标签存储在另一个维度1xtest中。

When it comes to analysis I basically plot as little as possible, because it takes so much work to get together a really good plot that teaches you a lot about the data in MATLAB. (I'm not the only one who feels this way).

说到分析,我基本上画的越少越好,因为要画出一个很好的图需要做很多工作才能教会你很多关于MATLAB中的数据。(我不是唯一有这种感觉的人)。

R

In R I've been sticking as close as I can to the MATLAB structure, but things get annoyingly complex when trying to keep the class labeling separate; I'd have to keep passing the labels into functions in even though I'm only using their attributes. So what I've done is separate the array into a list of arrays by CLASS. This adds complexity to all of my apply() functions, but seems to be worth it in terms of keeping things consistent (and bugs out).

在R中,我一直尽可能地靠近MATLAB的结构,但是当试图保持类标签的分离时,事情会变得非常复杂;我必须不断地将标签传递给函数,即使我只使用它们的属性。我所做的就是将数组按类分开成数组的列表。这给我的apply()函数增加了复杂性,但是在保持一致(并排除bug)方面似乎是值得的。

On the other hand, R just doesn't seem to be friendly with tensors/multidimensional arrays. Just to work with them, you need to grab the abind library. Documentation on multivariate analysis, like this example seems to operate under the assumption that you have a huge 2-D table of data points like some long medieval scroll a data frame, and doesn't mention how to get 'there' from where I am.

另一方面,R似乎对张量/多维数组不友好。要使用它们,需要获取abind库。关于多元分析的文档,比如这个例子,似乎是在假设你有一个巨大的二维数据点表格,比如一个很长的中世纪滚动的数据框架,并且没有提到如何从我所在的地方到达那里。

Once I get to plotting and classifying the processed data, it's not such a big problem, since by then I've worked my way down to data frame-friendly structures with shapes like TRIALSxFEATURES (melt has helped a lot with this). On the other hand, if I want to quickly generate a scatterplot matrix or latticist histogram set for the exploratory phase (i.e. statistical moments, separation, in/between-class variance, histograms, etc.), I have to stop and figure out how I'm going to apply() these huge multidimensional arrays into something those libraries understand.

一旦我对处理后的数据进行绘图和分类,这就不是什么大问题了,因为到那时为止,我已经通过使用TRIALSxFEATURES等形状的数据框架友好结构(melt在这方面帮了很大的忙)。另一方面,如果我想快速生成一个scatterplot矩阵或用于探索阶段的latticist直方图(即统计矩、分离、类间方差、直方图等),我必须停下来,并弄清楚如何将这些巨大的多维数组应用到这些库所理解的东西中。

If I keep pounding around in the jungle coming up with ad-hoc solutions for this, I'm either never going to get better or I'll end up with my own weird wizardly ways of doing it that don't make sense to anybody.

如果我继续在丛林中寻找临时的解决方案,我要么永远不会变得更好,要么我将以我自己怪异的魔法般的方式去做,这对任何人都没有意义。

So what's the right way to structure multivariate data with categorical labels accumulated over repeated trials for exploratory analysis in R? Please, I don't want to slip back to MATLAB.

那么,用在R中反复试验积累的分类标签构造多元数据的正确方法是什么呢?拜托,我不想回到MATLAB中。

Bonus: I tend to repeat these analyses over identical data structures for multiple subjects. Is there a better general way than wrapping the code chunks into for loops?

额外的好处:我倾向于在多个主题的相同数据结构上重复这些分析。有比将代码块封装到循环中更好的通用方法吗?

2 个解决方案

#1


9  

Maybe dplyr::tbl_cube ?

Working on from @BrodieG's excellent answer, I think that you may find it useful to look at the new functionality available from dplyr::tbl_cube. This is essentially a multidimensional object that you can easily create from a list of arrays (as you're currently using), which has some really good functions for subsetting, filtering and summarizing which (importantly, I think) are used consistently across the "cube" view and "tabular" view of the data.

根据@BrodieG的优秀答案,我认为您可能会发现查看dplyr::tbl_cube提供的新功能是有用的。这本质上是一个多维对象,您可以很容易地从数组列表(正如您目前使用的那样)创建它,它有一些非常好的函数来进行子设置、过滤和汇总(重要的是,我认为),这些函数在数据的“cube”视图和“tabular”视图中是一致使用的。

require(dplyr)

Couple of caveats:

一些说明:


It's an early release: all the issues that go along with that
It's recommended for this version to unload plyr when dplyr is loaded

这是一个早期版本:在加载dplyr时,建议这个版本卸载plyr

Loading arrays into cubes

Here's an example using arr as defined in the other answer:

这里有一个使用arr的例子,定义在另一个答案中:

# using arr from previous example
# we can convert it simply into a tbl_cube
arr.cube<-as.tbl_cube(arr)

arr.cube  
#Source: local array [24 x 3]  
#D: ser [chr, 3]  
#D: smp [chr, 2]  
#D: tr [chr, 4]  
#M: arr [dbl[3,2,4]]

So note that D means Dimensions and M Measures, and you can have as many as you like of each.

注意D表示维度和M度量,你可以有任意多个。

Easy conversion from multi-dimensional to flat

You can easily make the data tabular by returning it as a data.frame (which you can simply convert to a data.table if you need the functionality and performance benefits later)

通过将数据返回为data.frame(您可以简单地将其转换为数据),可以轻松地将数据制成表格。表,如果以后需要功能和性能的好处)

head(as.data.frame(arr.cube))
#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 2 smp 1 tr 1 0.6181301
#3 ser 3 smp 1 tr 1 0.7335676
#4 ser 1 smp 2 tr 1 0.9444435
#5 ser 2 smp 2 tr 1 0.8977054
#6 ser 3 smp 2 tr 1 0.9361929

Subsetting

You could obviously flatten all data for every operation, but that has many implications for performance and utility. I think the real benefit of this package is that you can "pre-mine" the cube for the data that you require before converting it into a tabular format that is ggplot-friendly, e.g. simple filtering to return only series 1:

显然,您可以为每个操作简化所有数据,但这对性能和效用有很多影响。我认为这个包的真正好处是,您可以在将数据转换成一个简单的表格格式之前,对数据进行“预挖掘”,例如简单的过滤,只返回第1系列:

arr.cube.filtered<-filter(arr.cube,ser=="ser 1")
as.data.frame(arr.cube.filtered)
#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 1 smp 2 tr 1 0.9444435
#3 ser 1 smp 1 tr 2 0.4331116
#4 ser 1 smp 2 tr 2 0.3916376
#5 ser 1 smp 1 tr 3 0.4669228
#6 ser 1 smp 2 tr 3 0.8942300
#7 ser 1 smp 1 tr 4 0.2054326
#8 ser 1 smp 2 tr 4 0.1006973

tbl_cube currently works with the dplyr functions summarise(), select(), group_by() and filter(). Usefully you can chain these together with the %.% operator.

tbl_cube目前使用dplyr函数summary()、select()、group_by()和filter()。有用的是,您可以将这些与%链接在一起。%运算符。

For the rest of the examples, I'm going to use the inbuilt nasa tbl_cube object, which has a bunch of meteorological data (and demonstrates multiple dimensions and measures):

在剩下的示例中,我将使用内置的nasa tbl_cube对象,该对象包含大量的气象数据(演示了多个维度和度量):

Grouping and summary measures

nasa
#Source: local array [41,472 x 4]
#D: lat [dbl, 24]
#D: long [dbl, 24]
#D: month [int, 12]
#D: year [int, 6]
#M: cloudhigh [dbl[24,24,12,6]]
#M: cloudlow [dbl[24,24,12,6]]
#M: cloudmid [dbl[24,24,12,6]]
#M: ozone [dbl[24,24,12,6]]
#M: pressure [dbl[24,24,12,6]]
#M: surftemp [dbl[24,24,12,6]]
#M: temperature [dbl[24,24,12,6]]

So here is an example showing how easy it is to pull back a subset of modified data from the cube, and then flatten it so that it's appropriate for plotting:

这里有一个例子,展示了从立方体中取出修改过的数据子集是多么容易,然后将它变平,以便它适合于绘图:

plot_data<-as.data.frame(          # as.data.frame so we can see the data
filter(nasa,long<(-70)) %.%        # filter long < (-70) (arbitrary!)
group_by(lat,long) %.%             # group by lat/long combo
summarise(p.max=max(pressure),     # create summary measures for each group
          o.avg=mean(ozone),
          c.all=(cloudhigh+cloudlow+cloudmid)/3)
)

head(plot_data)

#       lat   long p.max    o.avg    c.all
#1 36.20000 -113.8   975 310.7778 22.66667
#2 33.70435 -113.8   975 307.0833 21.33333
#3 31.20870 -113.8   990 300.3056 19.50000
#4 28.71304 -113.8  1000 290.3056 16.00000
#5 26.21739 -113.8  1000 282.4167 14.66667
#6 23.72174 -113.8  1000 275.6111 15.83333

Consistent notation for n-d and 2-d data structures

Sadly the mutate() function isn't yet implemented for tbl_cube but looks like that will just be a matter of (not much) time. You can use it (and all the other functions that work on the cube) on the tabular result, though - with exactly the same notation. For example:

遗憾的是,mutate()函数还没有为tbl_cube实现,但看起来这只是时间问题(不多)。您可以使用它(以及所有其他在多维数据集上工作的函数)在表格结果上—使用完全相同的符号。例如:

plot_data.mod<-filter(plot_data,lat>25) %.%    # filter out lat <=25
mutate(arb.meas=o.avg/p.max)                   # make a new column

head(plot_data.mod)

#       lat      long p.max    o.avg    c.all  arb.meas
#1 36.20000 -113.8000   975 310.7778 22.66667 0.3187464
#2 33.70435 -113.8000   975 307.0833 21.33333 0.3149573
#3 31.20870 -113.8000   990 300.3056 19.50000 0.3033389
#4 28.71304 -113.8000  1000 290.3056 16.00000 0.2903056
#5 26.21739 -113.8000  1000 282.4167 14.66667 0.2824167
#6 36.20000 -111.2957   930 313.9722 20.66667 0.3376045

Plotting - as an example of R functionality that "likes" flat data

Then you can plot with ggplot() using the benefits of flattened data:

然后您可以使用ggplot()使用扁平数据的好处进行绘图:

# plot as you like:
ggplot(plot_data.mod) +
  geom_point(aes(lat,long,size=c.all,color=c.all,shape=cut(p.max,6))) +
  facet_grid( lat ~ long ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

用R和MATLAB进行高维数据结构设计的方法

Using data.table on the resulting flat data

I'm not going to expand on the use of data.table here, as it's done well in the previous answer. Obviously there are many good reasons to use data.table - for any situation here you can return one by a simple conversion of the data.frame:

我不打算详述数据的使用。表在这里,因为前面的答案做得很好。显然,使用数据有很多很好的理由。表格-在任何情况下,你都可以通过简单的数据转换返回一个表格。

data.table(as.data.frame(your_cube_name))

Working dynamically with your cube

Another thing I think is great is the ability to add measures (slices / scenarios / shifts, whatever you want to call them) to your cube. I think this will fit well with the method of analysis described in the question. Here's a simple example with arr.cube - adding an additional measure which is itself an (admittedly simple) function of the previous measure. You access/update measures through the syntax yourcube$mets[$...]

我认为另一件很棒的事情是向多维数据集添加度量(切片/场景/转换,无论您想怎么称呼它们)的能力。我认为这将很适合问题中所描述的分析方法。这是arr的一个简单例子。多维数据集——添加一个额外的度量,它本身就是先前度量的(公认的简单)函数。您通过语法yourcube$mets访问/更新度量值[$…]

head(as.data.frame(arr.cube))

#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 2 smp 1 tr 1 0.6181301
#3 ser 3 smp 1 tr 1 0.7335676
#4 ser 1 smp 2 tr 1 0.9444435
#5 ser 2 smp 2 tr 1 0.8977054
#6 ser 3 smp 2 tr 1 0.9361929

arr.cube$mets$arr.bump<-arr.cube$mets$arr*1.1  #arb modification!

head(as.data.frame(arr.cube))

#    ser   smp   tr       arr  arr.bump
#1 ser 1 smp 1 tr 1 0.6656456 0.7322102
#2 ser 2 smp 1 tr 1 0.6181301 0.6799431
#3 ser 3 smp 1 tr 1 0.7335676 0.8069244
#4 ser 1 smp 2 tr 1 0.9444435 1.0388878
#5 ser 2 smp 2 tr 1 0.8977054 0.9874759
#6 ser 3 smp 2 tr 1 0.9361929 1.0298122

Dimensions - or not ...

I've played a little with trying to dynamically add entirely new dimensions (effectively scaling up an existing cube with additional dimensions and cloning or modifying the original data using yourcube$dims[$...]) but have found the behaviour to be a little inconsistent. Probably best to avoid this anyway, and structure your cube first before manipulating it. Will keep you posted if I get anywhere.

我尝试过动态地添加全新的维度(使用额外的维度有效地扩展现有的多维数据集,使用yourcube$dims克隆或修改原始数据),但发现这种行为有点不一致。无论如何,最好避免这种情况,在操作多维数据集之前先构造多维数据集。如果我有进展,我会随时通知你。

Persistance

Obviously one of the main issues with having interpreter access to a multidimensional database is the potential to accidentally bugger it with an ill-timed keystroke. So I guess just persist early and often:

显然,让解释器访问多维数据库的一个主要问题是,可能会意外地用不合时宜的击键来访问它。所以我想坚持得很早而且经常:

tempfilename<-gsub("[ :-]","",paste0("DBX",(Sys.time()),".cub"))
# save:
save(arr.cube,file=tempfilename)
# load:
load(file=tempfilename)

Hope that helps!

希望会有帮助!

#2


19  

As has been pointed out, many of the more powerful analytical and visualization tools rely on data in long format. Certainly for transformations that benefit from matrix algebra you should keep stuff in arrays, but as soon as you're wanting run parallel analysis on subsets of your data, or plot stuff by factors in your data, you really want to melt.

正如已经指出的,许多更强大的分析和可视化工具依赖于长格式的数据。当然,对于从矩阵代数中获益的转换,您应该将内容保存在数组中,但是只要您希望对数据的子集进行并行分析,或者根据数据中的因子绘制内容,那么您就真的想要融化。

Here is an example to get you started with data.table and ggplot.

这里有一个示例,可以让您从数据开始。表和ggplot。

Array -> Data Table

First, let's make some data in your format:

首先,让我们以您的格式制作一些数据:

series <- 3
samples <- 2
trials <- 4

trial.labs <- paste("tr", seq(len=trials))
trial.class <- sample(c("A", "B"), trials, rep=T)

arr <- array(
  runif(series * samples * trials), 
  dim=c(series, samples, trials),
  dimnames=list(
    ser=paste("ser", seq(len=series)), 
    smp=paste("smp", seq(len=samples)), 
    tr=trial.labs
  )
)
# , , tr = Trial 1
#        smp
# ser         smp 1     smp 2
#   ser 1 0.9648542 0.4134501
#   ser 2 0.7285704 0.1393077
#   ser 3 0.3142587 0.1012979
#
# ... omitted 2 trials ...
# 
# , , tr = Trial 4
#        smp
# ser         smp 1     smp 2
#   ser 1 0.5867905 0.5160964
#   ser 2 0.2432201 0.7702306
#   ser 3 0.2671743 0.8568685

Now we have a 3 dimensional array. Let's melt and turn it into a data.table (note melt operates on data.frames, which are basically data.tables sans bells & whistles, so we have to first melt, then convert to data.table):

现在我们有一个三维数组。让我们把它化为数据。表(注:熔体操作在data.frame上,基本上就是数据。桌子没有铃铛和口哨,所以我们必须先融化,然后再转换成数据。

library(reshape2)
library(data.table)

dt.raw <- data.table(melt(arr), key="tr")  # we'll get to what the `key` arg is doing later
#       ser   smp   tr      value
#  1: ser 1 smp 1 tr 1 0.53178276
#  2: ser 2 smp 1 tr 1 0.28574271
#  3: ser 3 smp 1 tr 1 0.62991366
#  4: ser 1 smp 2 tr 1 0.31073376
#  5: ser 2 smp 2 tr 1 0.36098971
# ---                            
# 20: ser 2 smp 1 tr 4 0.38049334
# 21: ser 3 smp 1 tr 4 0.14170226
# 22: ser 1 smp 2 tr 4 0.63719962
# 23: ser 2 smp 2 tr 4 0.07100314
# 24: ser 3 smp 2 tr 4 0.11864134

Notice how easy this was, with all our dimension labels trickling through to the long format. One of the bells & whistles of data.tables is the ability to do indexed merges between data.tables (much like MySQL indexed joins). So here, we will do that to bind the class to our data:

注意,这是多么容易,因为我们所有的维度标签都慢慢地变成了长格式。数据中最重要的部分之一。表是在数据之间进行索引合并的能力。表(类似于MySQL索引的连接)。这里,我们将把类绑定到我们的数据:

dt <- dt.raw[J(trial.labs, class=trial.class)]  # on the fly mapping of trials to class
#          tr   ser   smp     value class
#  1: Trial 1 ser 1 smp 1 0.9648542     A
#  2: Trial 1 ser 2 smp 1 0.7285704     A
#  3: Trial 1 ser 3 smp 1 0.3142587     A
#  4: Trial 1 ser 1 smp 2 0.4134501     A
#  5: Trial 1 ser 2 smp 2 0.1393077     A
# ---                                    
# 20: Trial 4 ser 2 smp 1 0.2432201     A
# 21: Trial 4 ser 3 smp 1 0.2671743     A
# 22: Trial 4 ser 1 smp 2 0.5160964     A
# 23: Trial 4 ser 2 smp 2 0.7702306     A
# 24: Trial 4 ser 3 smp 2 0.8568685     A

A few things to understand:

有几点需要理解:

  1. J creates a data.table from vectors
  2. 我创建了一个数据。从矢量表
  3. attempting to subset the rows of one data.table with another data table (i.e. using a data.table as the first argument after the brace in [.data.table) causes data.table to left join (in MySQL parlance) the outer table (dt in this case) to the inner table (the one created on the fly by J) in this case. The join is done on the key column(s) of the outer data.table, which as you may have noticed we defined in the melt/data.table conversion step earlier.
  4. 尝试对一个数据的行进行子集划分。包含另一个数据表的表(即使用数据)。表作为[.data.table]中括号后的第一个参数,将导致数据。表左连接(用MySQL表示)外部表(本例中是dt)到内部表(即在运行时由J创建的表)。连接在外部数据的键列上完成。表,你们可能已经注意到我们在熔体/数据中定义的。表转换步骤。

You'll have to read the documentation to fully understand what's going on, but think of J(trial.labs, class=trial.class) being effectively equivalent to creating a data.table with data.table(trial.labs, class=trial.class), except J only works when used inside [.data.table.

您必须阅读文档以充分理解正在发生的事情,但是请考虑J(试验)。有效地等价于创建一个数据。表与data.table(审判。实验室,class=trial.class),除了J只在[.data.table]中使用时才起作用。

So now, in one easy step we have our class data attached to the values. Again, if you need matrix algebra, operate on your array first, and then in two or three easy commands switch back to the long format. As noted in the comments, you probably don't want to be going back and forth from the long to array formats unless you have a really good reason to be doing so.

现在,在一个简单的步骤中,我们将类数据附加到值上。同样,如果需要矩阵代数,请先对数组进行操作,然后在两个或三个简单的命令中切换回长格式。正如评论中所提到的,您可能不希望在长格式到数组格式之间来回切换,除非您有一个非常好的理由这样做。

Once things are in data.table, you can group/aggregate your data (similar to the concept of split-apply-combine style) quite easily. Suppose we want to get summary statistics for each class-sample combination:

一旦事情进入数据。表中,可以很容易地对数据进行分组/聚合(类似于split-apply-combine样式的概念)。假设我们想获得每个类样本组合的汇总统计信息:

dt[, as.list(summary(value)), by=list(class, smp)]

#    class   smp    Min. 1st Qu. Median   Mean 3rd Qu.   Max.
# 1:     A smp 1 0.08324  0.2537 0.3143 0.4708  0.7286 0.9649
# 2:     A smp 2 0.10130  0.1609 0.5161 0.4749  0.6894 0.8569
# 3:     B smp 1 0.14050  0.3089 0.4773 0.5049  0.6872 0.8970
# 4:     B smp 2 0.08294  0.1196 0.1562 0.3818  0.5313 0.9063

Here, we just give data.table an expression (as.list(summary(value))) to evaluate for every class, smp subset of the data (as specified in the by expression). We need as.list so that the results are re-assembled by data.table as columns.

这里,我们只提供数据。表a表达式(as.list(summary(value))))用于计算每个类的值,数据的smp子集(如by表达式中指定的)。我们需要。列出以使结果由数据重新组合。表列。

You could just as easily have calculated moments (e.g. list(mean(value), var(value), (value - mean(value))^3) for any combination of the class/sample/trial/series variables.

你可以计算时刻(例如列表(意味着(值),var(值),(价值——意味着(值))^ 3)的任意组合类/样本/试验/系列变量。

If you want to do simple transformations to the data it is very easy with data.table:

如果您想对数据进行简单的转换,那么使用数据是非常容易的。

dt[, value:=value * 10]  # modify in place with `:=`, very efficient
dt[1:2]                  # see, `value` now 10x    
#         tr   ser   smp    value class
# 1: Trial 1 ser 1 smp 1 9.648542     A
# 2: Trial 1 ser 2 smp 1 7.285704     A

This is an in-place transformation, so there are no memory copies, which makes it fast. Generally data.table tries to use memory as efficiently as possible and as such is one of the fastest ways to do this type of analysis.

这是一个就地转换,因此没有内存拷贝,这使得转换速度很快。一般数据。表试图尽可能有效地使用内存,因此这是进行这种类型分析的最快方法之一。

Plotting From Long Format

ggplot is fantastic for plotting data in long format. I won't get into the details of what's happening, but hopefully the images will give you an idea of what you can do

ggplot非常适合用长格式绘制数据。我不会详细讲发生了什么,但是希望图片能让你知道你能做什么

library(ggplot2)
ggplot(data=dt, aes(x=ser, y=smp, color=class, size=value)) + 
  geom_point() +
  facet_wrap( ~ tr)

用R和MATLAB进行高维数据结构设计的方法

ggplot(data=dt, aes(x=tr, y=value, fill=class)) + 
  geom_bar(stat="identity") +
  facet_grid(smp ~ ser)

用R和MATLAB进行高维数据结构设计的方法

ggplot(data=dt, aes(x=tr, y=paste(ser, smp))) + 
  geom_tile(aes(fill=value)) + 
  geom_point(aes(shape=class), size=5) + 
  scale_fill_gradient2(low="yellow", high="blue", midpoint=median(dt$value))

用R和MATLAB进行高维数据结构设计的方法

Data Table -> Array -> Data Table

First we need to acast (from package reshape2) our data table back to an array:

首先,我们需要将数据表(从package reshape2)返回到一个数组:

arr.2 <- acast(dt, ser ~ smp ~ tr, value.var="value")
dimnames(arr.2) <- dimnames(arr)  # unfortunately `acast` doesn't preserve dimnames properly
# , , tr = Trial 1
#        smp
# ser        smp 1    smp 2
#   ser 1 9.648542 4.134501
#   ser 2 7.285704 1.393077
#   ser 3 3.142587 1.012979
# ... omitted 3 trials ...

At this point, arr.2 looks just like arr did, except with values multiplied by 10. Note we had to drop the class column. Now, let's do some trivial matrix algebra

在这一点上,加勒比海盗。2和arr是一样的,除了值乘以10。注意,我们必须删除class列。现在,我们来做一些平凡的矩阵代数

shuff.mat <- matrix(c(0, 1, 1, 0), nrow=2) # re-order columns
for(i in 1:dim(arr.2)[3]) arr.2[, , i] <- arr.2[, , i] %*% shuff.mat

Now, let's go back to long format with melt. Note the key argument:

现在,让我们回到《熔融》的长格式。注意关键参数:

dt.2 <- data.table(melt(arr.2, value.name="new.value"), key=c("tr", "ser", "smp"))

Finally, let's join back dt and dt.2. Here you need to be careful. The behavior of data.table is that the inner table will be joined to the outer table based on all the keys of the inner table if the outer table has no keys. If the inner table has keys, data.table will join key to key. This is a problem here because our intended outer table, dt already has a key on only tr from earlier, so our join will happen on that column only. Because of that, we need to either drop the key on the outer table, or reset the key (we chose the latter here):

最后,让我们回到dt和dt。2。在这里你需要小心。数据的行为。表是指,如果外表没有键,那么内表将根据内表的所有键与外表连接。如果内部表有键、数据。表将连接键到键。这是一个问题,因为我们想要的外部表,dt之前已经有一个关于tr的键,所以我们的连接将只发生在这一列上。正因如此,我们需要将钥匙放在外表上,或者重置钥匙(我们在这里选择了后者):

setkey(dt, tr, ser, smp)
dt[dt.2]
#          tr   ser   smp    value class new.value
#  1: Trial 1 ser 1 smp 1 9.648542     A  4.134501
#  2: Trial 1 ser 1 smp 2 4.134501     A  9.648542
#  3: Trial 1 ser 2 smp 1 7.285704     A  1.393077
#  4: Trial 1 ser 2 smp 2 1.393077     A  7.285704
#  5: Trial 1 ser 3 smp 1 3.142587     A  1.012979
# ---                                             
# 20: Trial 4 ser 1 smp 2 5.160964     A  5.867905
# 21: Trial 4 ser 2 smp 1 2.432201     A  7.702306
# 22: Trial 4 ser 2 smp 2 7.702306     A  2.432201
# 23: Trial 4 ser 3 smp 1 2.671743     A  8.568685
# 24: Trial 4 ser 3 smp 2 8.568685     A  2.671743

Note that data.table carries out joins by matching key columns, that is - by matching the first key column of the outer table to the first column/key of the inner table, the second to the second, and so on, not considering column names (there's a FR here). If your tables / keys are not in the same order (as was the case here, if you noticed), you either need to re-order your columns or make sure that both tables have keys on the columns you want in the same order (what we did here). The reason the columns were not in the correct order is because of the first join we did to add the class in, which joined on tr and caused that column to become the first one in the data.table.

请注意,数据。表通过匹配键列进行连接,即通过将外部表的第一个键列与内部表的第一个列/键、第二个键和第二个键匹配,等等,而不考虑列名(这里有一个FR)。如果表/键的顺序不相同(如果您注意到),则需要重新排序您的列,或者确保两个表都有相同顺序的列上的键(我们在这里所做的)。列没有按正确顺序排列的原因是我们添加类的第一个连接,它在tr上连接并使该列成为data.table中的第一个。

#1


9  

Maybe dplyr::tbl_cube ?

Working on from @BrodieG's excellent answer, I think that you may find it useful to look at the new functionality available from dplyr::tbl_cube. This is essentially a multidimensional object that you can easily create from a list of arrays (as you're currently using), which has some really good functions for subsetting, filtering and summarizing which (importantly, I think) are used consistently across the "cube" view and "tabular" view of the data.

根据@BrodieG的优秀答案,我认为您可能会发现查看dplyr::tbl_cube提供的新功能是有用的。这本质上是一个多维对象,您可以很容易地从数组列表(正如您目前使用的那样)创建它,它有一些非常好的函数来进行子设置、过滤和汇总(重要的是,我认为),这些函数在数据的“cube”视图和“tabular”视图中是一致使用的。

require(dplyr)

Couple of caveats:

一些说明:


It's an early release: all the issues that go along with that
It's recommended for this version to unload plyr when dplyr is loaded

这是一个早期版本:在加载dplyr时,建议这个版本卸载plyr

Loading arrays into cubes

Here's an example using arr as defined in the other answer:

这里有一个使用arr的例子,定义在另一个答案中:

# using arr from previous example
# we can convert it simply into a tbl_cube
arr.cube<-as.tbl_cube(arr)

arr.cube  
#Source: local array [24 x 3]  
#D: ser [chr, 3]  
#D: smp [chr, 2]  
#D: tr [chr, 4]  
#M: arr [dbl[3,2,4]]

So note that D means Dimensions and M Measures, and you can have as many as you like of each.

注意D表示维度和M度量,你可以有任意多个。

Easy conversion from multi-dimensional to flat

You can easily make the data tabular by returning it as a data.frame (which you can simply convert to a data.table if you need the functionality and performance benefits later)

通过将数据返回为data.frame(您可以简单地将其转换为数据),可以轻松地将数据制成表格。表,如果以后需要功能和性能的好处)

head(as.data.frame(arr.cube))
#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 2 smp 1 tr 1 0.6181301
#3 ser 3 smp 1 tr 1 0.7335676
#4 ser 1 smp 2 tr 1 0.9444435
#5 ser 2 smp 2 tr 1 0.8977054
#6 ser 3 smp 2 tr 1 0.9361929

Subsetting

You could obviously flatten all data for every operation, but that has many implications for performance and utility. I think the real benefit of this package is that you can "pre-mine" the cube for the data that you require before converting it into a tabular format that is ggplot-friendly, e.g. simple filtering to return only series 1:

显然,您可以为每个操作简化所有数据,但这对性能和效用有很多影响。我认为这个包的真正好处是,您可以在将数据转换成一个简单的表格格式之前,对数据进行“预挖掘”,例如简单的过滤,只返回第1系列:

arr.cube.filtered<-filter(arr.cube,ser=="ser 1")
as.data.frame(arr.cube.filtered)
#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 1 smp 2 tr 1 0.9444435
#3 ser 1 smp 1 tr 2 0.4331116
#4 ser 1 smp 2 tr 2 0.3916376
#5 ser 1 smp 1 tr 3 0.4669228
#6 ser 1 smp 2 tr 3 0.8942300
#7 ser 1 smp 1 tr 4 0.2054326
#8 ser 1 smp 2 tr 4 0.1006973

tbl_cube currently works with the dplyr functions summarise(), select(), group_by() and filter(). Usefully you can chain these together with the %.% operator.

tbl_cube目前使用dplyr函数summary()、select()、group_by()和filter()。有用的是,您可以将这些与%链接在一起。%运算符。

For the rest of the examples, I'm going to use the inbuilt nasa tbl_cube object, which has a bunch of meteorological data (and demonstrates multiple dimensions and measures):

在剩下的示例中,我将使用内置的nasa tbl_cube对象,该对象包含大量的气象数据(演示了多个维度和度量):

Grouping and summary measures

nasa
#Source: local array [41,472 x 4]
#D: lat [dbl, 24]
#D: long [dbl, 24]
#D: month [int, 12]
#D: year [int, 6]
#M: cloudhigh [dbl[24,24,12,6]]
#M: cloudlow [dbl[24,24,12,6]]
#M: cloudmid [dbl[24,24,12,6]]
#M: ozone [dbl[24,24,12,6]]
#M: pressure [dbl[24,24,12,6]]
#M: surftemp [dbl[24,24,12,6]]
#M: temperature [dbl[24,24,12,6]]

So here is an example showing how easy it is to pull back a subset of modified data from the cube, and then flatten it so that it's appropriate for plotting:

这里有一个例子,展示了从立方体中取出修改过的数据子集是多么容易,然后将它变平,以便它适合于绘图:

plot_data<-as.data.frame(          # as.data.frame so we can see the data
filter(nasa,long<(-70)) %.%        # filter long < (-70) (arbitrary!)
group_by(lat,long) %.%             # group by lat/long combo
summarise(p.max=max(pressure),     # create summary measures for each group
          o.avg=mean(ozone),
          c.all=(cloudhigh+cloudlow+cloudmid)/3)
)

head(plot_data)

#       lat   long p.max    o.avg    c.all
#1 36.20000 -113.8   975 310.7778 22.66667
#2 33.70435 -113.8   975 307.0833 21.33333
#3 31.20870 -113.8   990 300.3056 19.50000
#4 28.71304 -113.8  1000 290.3056 16.00000
#5 26.21739 -113.8  1000 282.4167 14.66667
#6 23.72174 -113.8  1000 275.6111 15.83333

Consistent notation for n-d and 2-d data structures

Sadly the mutate() function isn't yet implemented for tbl_cube but looks like that will just be a matter of (not much) time. You can use it (and all the other functions that work on the cube) on the tabular result, though - with exactly the same notation. For example:

遗憾的是,mutate()函数还没有为tbl_cube实现,但看起来这只是时间问题(不多)。您可以使用它(以及所有其他在多维数据集上工作的函数)在表格结果上—使用完全相同的符号。例如:

plot_data.mod<-filter(plot_data,lat>25) %.%    # filter out lat <=25
mutate(arb.meas=o.avg/p.max)                   # make a new column

head(plot_data.mod)

#       lat      long p.max    o.avg    c.all  arb.meas
#1 36.20000 -113.8000   975 310.7778 22.66667 0.3187464
#2 33.70435 -113.8000   975 307.0833 21.33333 0.3149573
#3 31.20870 -113.8000   990 300.3056 19.50000 0.3033389
#4 28.71304 -113.8000  1000 290.3056 16.00000 0.2903056
#5 26.21739 -113.8000  1000 282.4167 14.66667 0.2824167
#6 36.20000 -111.2957   930 313.9722 20.66667 0.3376045

Plotting - as an example of R functionality that "likes" flat data

Then you can plot with ggplot() using the benefits of flattened data:

然后您可以使用ggplot()使用扁平数据的好处进行绘图:

# plot as you like:
ggplot(plot_data.mod) +
  geom_point(aes(lat,long,size=c.all,color=c.all,shape=cut(p.max,6))) +
  facet_grid( lat ~ long ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

用R和MATLAB进行高维数据结构设计的方法

Using data.table on the resulting flat data

I'm not going to expand on the use of data.table here, as it's done well in the previous answer. Obviously there are many good reasons to use data.table - for any situation here you can return one by a simple conversion of the data.frame:

我不打算详述数据的使用。表在这里,因为前面的答案做得很好。显然,使用数据有很多很好的理由。表格-在任何情况下,你都可以通过简单的数据转换返回一个表格。

data.table(as.data.frame(your_cube_name))

Working dynamically with your cube

Another thing I think is great is the ability to add measures (slices / scenarios / shifts, whatever you want to call them) to your cube. I think this will fit well with the method of analysis described in the question. Here's a simple example with arr.cube - adding an additional measure which is itself an (admittedly simple) function of the previous measure. You access/update measures through the syntax yourcube$mets[$...]

我认为另一件很棒的事情是向多维数据集添加度量(切片/场景/转换,无论您想怎么称呼它们)的能力。我认为这将很适合问题中所描述的分析方法。这是arr的一个简单例子。多维数据集——添加一个额外的度量,它本身就是先前度量的(公认的简单)函数。您通过语法yourcube$mets访问/更新度量值[$…]

head(as.data.frame(arr.cube))

#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 2 smp 1 tr 1 0.6181301
#3 ser 3 smp 1 tr 1 0.7335676
#4 ser 1 smp 2 tr 1 0.9444435
#5 ser 2 smp 2 tr 1 0.8977054
#6 ser 3 smp 2 tr 1 0.9361929

arr.cube$mets$arr.bump<-arr.cube$mets$arr*1.1  #arb modification!

head(as.data.frame(arr.cube))

#    ser   smp   tr       arr  arr.bump
#1 ser 1 smp 1 tr 1 0.6656456 0.7322102
#2 ser 2 smp 1 tr 1 0.6181301 0.6799431
#3 ser 3 smp 1 tr 1 0.7335676 0.8069244
#4 ser 1 smp 2 tr 1 0.9444435 1.0388878
#5 ser 2 smp 2 tr 1 0.8977054 0.9874759
#6 ser 3 smp 2 tr 1 0.9361929 1.0298122

Dimensions - or not ...

I've played a little with trying to dynamically add entirely new dimensions (effectively scaling up an existing cube with additional dimensions and cloning or modifying the original data using yourcube$dims[$...]) but have found the behaviour to be a little inconsistent. Probably best to avoid this anyway, and structure your cube first before manipulating it. Will keep you posted if I get anywhere.

我尝试过动态地添加全新的维度(使用额外的维度有效地扩展现有的多维数据集,使用yourcube$dims克隆或修改原始数据),但发现这种行为有点不一致。无论如何,最好避免这种情况,在操作多维数据集之前先构造多维数据集。如果我有进展,我会随时通知你。

Persistance

Obviously one of the main issues with having interpreter access to a multidimensional database is the potential to accidentally bugger it with an ill-timed keystroke. So I guess just persist early and often:

显然,让解释器访问多维数据库的一个主要问题是,可能会意外地用不合时宜的击键来访问它。所以我想坚持得很早而且经常:

tempfilename<-gsub("[ :-]","",paste0("DBX",(Sys.time()),".cub"))
# save:
save(arr.cube,file=tempfilename)
# load:
load(file=tempfilename)

Hope that helps!

希望会有帮助!

#2


19  

As has been pointed out, many of the more powerful analytical and visualization tools rely on data in long format. Certainly for transformations that benefit from matrix algebra you should keep stuff in arrays, but as soon as you're wanting run parallel analysis on subsets of your data, or plot stuff by factors in your data, you really want to melt.

正如已经指出的,许多更强大的分析和可视化工具依赖于长格式的数据。当然,对于从矩阵代数中获益的转换,您应该将内容保存在数组中,但是只要您希望对数据的子集进行并行分析,或者根据数据中的因子绘制内容,那么您就真的想要融化。

Here is an example to get you started with data.table and ggplot.

这里有一个示例,可以让您从数据开始。表和ggplot。

Array -> Data Table

First, let's make some data in your format:

首先,让我们以您的格式制作一些数据:

series <- 3
samples <- 2
trials <- 4

trial.labs <- paste("tr", seq(len=trials))
trial.class <- sample(c("A", "B"), trials, rep=T)

arr <- array(
  runif(series * samples * trials), 
  dim=c(series, samples, trials),
  dimnames=list(
    ser=paste("ser", seq(len=series)), 
    smp=paste("smp", seq(len=samples)), 
    tr=trial.labs
  )
)
# , , tr = Trial 1
#        smp
# ser         smp 1     smp 2
#   ser 1 0.9648542 0.4134501
#   ser 2 0.7285704 0.1393077
#   ser 3 0.3142587 0.1012979
#
# ... omitted 2 trials ...
# 
# , , tr = Trial 4
#        smp
# ser         smp 1     smp 2
#   ser 1 0.5867905 0.5160964
#   ser 2 0.2432201 0.7702306
#   ser 3 0.2671743 0.8568685

Now we have a 3 dimensional array. Let's melt and turn it into a data.table (note melt operates on data.frames, which are basically data.tables sans bells & whistles, so we have to first melt, then convert to data.table):

现在我们有一个三维数组。让我们把它化为数据。表(注:熔体操作在data.frame上,基本上就是数据。桌子没有铃铛和口哨,所以我们必须先融化,然后再转换成数据。

library(reshape2)
library(data.table)

dt.raw <- data.table(melt(arr), key="tr")  # we'll get to what the `key` arg is doing later
#       ser   smp   tr      value
#  1: ser 1 smp 1 tr 1 0.53178276
#  2: ser 2 smp 1 tr 1 0.28574271
#  3: ser 3 smp 1 tr 1 0.62991366
#  4: ser 1 smp 2 tr 1 0.31073376
#  5: ser 2 smp 2 tr 1 0.36098971
# ---                            
# 20: ser 2 smp 1 tr 4 0.38049334
# 21: ser 3 smp 1 tr 4 0.14170226
# 22: ser 1 smp 2 tr 4 0.63719962
# 23: ser 2 smp 2 tr 4 0.07100314
# 24: ser 3 smp 2 tr 4 0.11864134

Notice how easy this was, with all our dimension labels trickling through to the long format. One of the bells & whistles of data.tables is the ability to do indexed merges between data.tables (much like MySQL indexed joins). So here, we will do that to bind the class to our data:

注意,这是多么容易,因为我们所有的维度标签都慢慢地变成了长格式。数据中最重要的部分之一。表是在数据之间进行索引合并的能力。表(类似于MySQL索引的连接)。这里,我们将把类绑定到我们的数据:

dt <- dt.raw[J(trial.labs, class=trial.class)]  # on the fly mapping of trials to class
#          tr   ser   smp     value class
#  1: Trial 1 ser 1 smp 1 0.9648542     A
#  2: Trial 1 ser 2 smp 1 0.7285704     A
#  3: Trial 1 ser 3 smp 1 0.3142587     A
#  4: Trial 1 ser 1 smp 2 0.4134501     A
#  5: Trial 1 ser 2 smp 2 0.1393077     A
# ---                                    
# 20: Trial 4 ser 2 smp 1 0.2432201     A
# 21: Trial 4 ser 3 smp 1 0.2671743     A
# 22: Trial 4 ser 1 smp 2 0.5160964     A
# 23: Trial 4 ser 2 smp 2 0.7702306     A
# 24: Trial 4 ser 3 smp 2 0.8568685     A

A few things to understand:

有几点需要理解:

  1. J creates a data.table from vectors
  2. 我创建了一个数据。从矢量表
  3. attempting to subset the rows of one data.table with another data table (i.e. using a data.table as the first argument after the brace in [.data.table) causes data.table to left join (in MySQL parlance) the outer table (dt in this case) to the inner table (the one created on the fly by J) in this case. The join is done on the key column(s) of the outer data.table, which as you may have noticed we defined in the melt/data.table conversion step earlier.
  4. 尝试对一个数据的行进行子集划分。包含另一个数据表的表(即使用数据)。表作为[.data.table]中括号后的第一个参数,将导致数据。表左连接(用MySQL表示)外部表(本例中是dt)到内部表(即在运行时由J创建的表)。连接在外部数据的键列上完成。表,你们可能已经注意到我们在熔体/数据中定义的。表转换步骤。

You'll have to read the documentation to fully understand what's going on, but think of J(trial.labs, class=trial.class) being effectively equivalent to creating a data.table with data.table(trial.labs, class=trial.class), except J only works when used inside [.data.table.

您必须阅读文档以充分理解正在发生的事情,但是请考虑J(试验)。有效地等价于创建一个数据。表与data.table(审判。实验室,class=trial.class),除了J只在[.data.table]中使用时才起作用。

So now, in one easy step we have our class data attached to the values. Again, if you need matrix algebra, operate on your array first, and then in two or three easy commands switch back to the long format. As noted in the comments, you probably don't want to be going back and forth from the long to array formats unless you have a really good reason to be doing so.

现在,在一个简单的步骤中,我们将类数据附加到值上。同样,如果需要矩阵代数,请先对数组进行操作,然后在两个或三个简单的命令中切换回长格式。正如评论中所提到的,您可能不希望在长格式到数组格式之间来回切换,除非您有一个非常好的理由这样做。

Once things are in data.table, you can group/aggregate your data (similar to the concept of split-apply-combine style) quite easily. Suppose we want to get summary statistics for each class-sample combination:

一旦事情进入数据。表中,可以很容易地对数据进行分组/聚合(类似于split-apply-combine样式的概念)。假设我们想获得每个类样本组合的汇总统计信息:

dt[, as.list(summary(value)), by=list(class, smp)]

#    class   smp    Min. 1st Qu. Median   Mean 3rd Qu.   Max.
# 1:     A smp 1 0.08324  0.2537 0.3143 0.4708  0.7286 0.9649
# 2:     A smp 2 0.10130  0.1609 0.5161 0.4749  0.6894 0.8569
# 3:     B smp 1 0.14050  0.3089 0.4773 0.5049  0.6872 0.8970
# 4:     B smp 2 0.08294  0.1196 0.1562 0.3818  0.5313 0.9063

Here, we just give data.table an expression (as.list(summary(value))) to evaluate for every class, smp subset of the data (as specified in the by expression). We need as.list so that the results are re-assembled by data.table as columns.

这里,我们只提供数据。表a表达式(as.list(summary(value))))用于计算每个类的值,数据的smp子集(如by表达式中指定的)。我们需要。列出以使结果由数据重新组合。表列。

You could just as easily have calculated moments (e.g. list(mean(value), var(value), (value - mean(value))^3) for any combination of the class/sample/trial/series variables.

你可以计算时刻(例如列表(意味着(值),var(值),(价值——意味着(值))^ 3)的任意组合类/样本/试验/系列变量。

If you want to do simple transformations to the data it is very easy with data.table:

如果您想对数据进行简单的转换,那么使用数据是非常容易的。

dt[, value:=value * 10]  # modify in place with `:=`, very efficient
dt[1:2]                  # see, `value` now 10x    
#         tr   ser   smp    value class
# 1: Trial 1 ser 1 smp 1 9.648542     A
# 2: Trial 1 ser 2 smp 1 7.285704     A

This is an in-place transformation, so there are no memory copies, which makes it fast. Generally data.table tries to use memory as efficiently as possible and as such is one of the fastest ways to do this type of analysis.

这是一个就地转换,因此没有内存拷贝,这使得转换速度很快。一般数据。表试图尽可能有效地使用内存,因此这是进行这种类型分析的最快方法之一。

Plotting From Long Format

ggplot is fantastic for plotting data in long format. I won't get into the details of what's happening, but hopefully the images will give you an idea of what you can do

ggplot非常适合用长格式绘制数据。我不会详细讲发生了什么,但是希望图片能让你知道你能做什么

library(ggplot2)
ggplot(data=dt, aes(x=ser, y=smp, color=class, size=value)) + 
  geom_point() +
  facet_wrap( ~ tr)

用R和MATLAB进行高维数据结构设计的方法

ggplot(data=dt, aes(x=tr, y=value, fill=class)) + 
  geom_bar(stat="identity") +
  facet_grid(smp ~ ser)

用R和MATLAB进行高维数据结构设计的方法

ggplot(data=dt, aes(x=tr, y=paste(ser, smp))) + 
  geom_tile(aes(fill=value)) + 
  geom_point(aes(shape=class), size=5) + 
  scale_fill_gradient2(low="yellow", high="blue", midpoint=median(dt$value))

用R和MATLAB进行高维数据结构设计的方法

Data Table -> Array -> Data Table

First we need to acast (from package reshape2) our data table back to an array:

首先,我们需要将数据表(从package reshape2)返回到一个数组:

arr.2 <- acast(dt, ser ~ smp ~ tr, value.var="value")
dimnames(arr.2) <- dimnames(arr)  # unfortunately `acast` doesn't preserve dimnames properly
# , , tr = Trial 1
#        smp
# ser        smp 1    smp 2
#   ser 1 9.648542 4.134501
#   ser 2 7.285704 1.393077
#   ser 3 3.142587 1.012979
# ... omitted 3 trials ...

At this point, arr.2 looks just like arr did, except with values multiplied by 10. Note we had to drop the class column. Now, let's do some trivial matrix algebra

在这一点上,加勒比海盗。2和arr是一样的,除了值乘以10。注意,我们必须删除class列。现在,我们来做一些平凡的矩阵代数

shuff.mat <- matrix(c(0, 1, 1, 0), nrow=2) # re-order columns
for(i in 1:dim(arr.2)[3]) arr.2[, , i] <- arr.2[, , i] %*% shuff.mat

Now, let's go back to long format with melt. Note the key argument:

现在,让我们回到《熔融》的长格式。注意关键参数:

dt.2 <- data.table(melt(arr.2, value.name="new.value"), key=c("tr", "ser", "smp"))

Finally, let's join back dt and dt.2. Here you need to be careful. The behavior of data.table is that the inner table will be joined to the outer table based on all the keys of the inner table if the outer table has no keys. If the inner table has keys, data.table will join key to key. This is a problem here because our intended outer table, dt already has a key on only tr from earlier, so our join will happen on that column only. Because of that, we need to either drop the key on the outer table, or reset the key (we chose the latter here):

最后,让我们回到dt和dt。2。在这里你需要小心。数据的行为。表是指,如果外表没有键,那么内表将根据内表的所有键与外表连接。如果内部表有键、数据。表将连接键到键。这是一个问题,因为我们想要的外部表,dt之前已经有一个关于tr的键,所以我们的连接将只发生在这一列上。正因如此,我们需要将钥匙放在外表上,或者重置钥匙(我们在这里选择了后者):

setkey(dt, tr, ser, smp)
dt[dt.2]
#          tr   ser   smp    value class new.value
#  1: Trial 1 ser 1 smp 1 9.648542     A  4.134501
#  2: Trial 1 ser 1 smp 2 4.134501     A  9.648542
#  3: Trial 1 ser 2 smp 1 7.285704     A  1.393077
#  4: Trial 1 ser 2 smp 2 1.393077     A  7.285704
#  5: Trial 1 ser 3 smp 1 3.142587     A  1.012979
# ---                                             
# 20: Trial 4 ser 1 smp 2 5.160964     A  5.867905
# 21: Trial 4 ser 2 smp 1 2.432201     A  7.702306
# 22: Trial 4 ser 2 smp 2 7.702306     A  2.432201
# 23: Trial 4 ser 3 smp 1 2.671743     A  8.568685
# 24: Trial 4 ser 3 smp 2 8.568685     A  2.671743

Note that data.table carries out joins by matching key columns, that is - by matching the first key column of the outer table to the first column/key of the inner table, the second to the second, and so on, not considering column names (there's a FR here). If your tables / keys are not in the same order (as was the case here, if you noticed), you either need to re-order your columns or make sure that both tables have keys on the columns you want in the same order (what we did here). The reason the columns were not in the correct order is because of the first join we did to add the class in, which joined on tr and caused that column to become the first one in the data.table.

请注意,数据。表通过匹配键列进行连接,即通过将外部表的第一个键列与内部表的第一个列/键、第二个键和第二个键匹配,等等,而不考虑列名(这里有一个FR)。如果表/键的顺序不相同(如果您注意到),则需要重新排序您的列,或者确保两个表都有相同顺序的列上的键(我们在这里所做的)。列没有按正确顺序排列的原因是我们添加类的第一个连接,它在tr上连接并使该列成为data.table中的第一个。