这种从长到宽的重塑我做错了什么?

时间:2022-09-16 11:26:09

the problem

A function I wrote to widen a long table of repeated multivariate time series data for input to classifier functions seems to cause erroneous results even for easy test data, but I can't locate the issue.

我写的一个函数,用于扩展重复多变量时间序列数据的长表,用于输入分类器函数,即使对于简单的测试数据,也会导致错误的结果,但我无法找到问题。

background

I am keeping a bunch of repeated trials of multivariate time series in a long data.table format like this, for speed and ease of use with most R idioms:

为了获得大多数R惯用语的速度和易用性,我保留了一系列多变量时间序列的多次时间序列的重复试验。

> this.data
              Time Trial Class Channel      Value
     1: -100.00000     1    -1      V1  0.4551513
     2:  -96.07843     2    -1      V1  0.8241555
     3:  -92.15686     3    -1      V1  0.7667328
     4:  -88.23529     4    -1      V1  0.7475106
     5:  -84.31373     5    -1      V1  0.9810273
    ---                                          
204796:  884.31373   196     1      V4 50.2642220
204797:  888.23529   197     1      V4 50.5747661
204798:  892.15686   198     1      V4 50.5749421
204799:  896.07843   199     1      V4 50.1988299
204800:  900.00000   200     1      V4 50.7756015

Specifically, the above data has a Time column with 256 unique numbers from 0 to 900, which is repeated for each Channel, for each Trial. Similarly, each Channel is one of V1,V2,V3,V4, repeated for each Time sample, for each Trial. In other words, any combination of Time,Trial,Channel uniquely specifies a Value. In order to keep things simple, all Trials under 100 have Class -1, and all above 99 have Class 1. (For testing purposes, all Values in Class 1 have a mean of 50, while those in Class 0 have a mean of 0. (This data can be generated and tweaked using the dummy.plug() function included in a gist I made.)

具体而言,上述数据具有时间列,其具有从0到900的256个唯一数字,对于每个试验,对于每个通道重复该数字。类似地,对于每个试验,每个通道是V1,V2,V3,V4中的一个,对于每个时间样本重复。换句话说,Time,Trial,Channel的任何组合都唯一地指定一个值。为了简单起见,100以下的所有试验都有-1级,99以上都有1级。(出于测试目的,1级中的所有值均为50,而0级中的所有值均为0。 。(可以使用我制作的要点中包含的dummy.plug()函数生成和调整此数据。)

In order to process the data using different machine learning classification algorithms, it seems to be necessary to reshape the data to something a little bit wider, so that each of the time series has its own column, while the others remain as ids. (For example, the stepwise classifier stepclass from klaR needs the features in different columns, so it can choose which ones to drop or add to its model as it trains.) Since there are repeated trials, I have not had success making existing functions like the cast family work, and so I wrote my own:

为了使用不同的机器学习分类算法处理数据,似乎有必要将数据重新整形为更宽一些,以便每个时间序列都有自己的列,而其他时间序列仍然是ID。 (例如,来自klaR的逐步分类器步骤类需要不同列中的特征,因此它可以选择在训练时丢弃或添加到模型中的特征。)由于有重复的试验,我没有成功地使现有的功能像演员家庭的工作,所以我写了自己的:

##### converting from long table form to channel-split wide form #####
# for multivariate repeated time series
channel.form <- function(input.table,
                         value.col = "Voltage",
                         split.col = "Channel",
                         class.col = "Class",
                         time.col = "Time",
                         trial.col = "Trial") {
# Converts long table format to slightly wider format split by channels.
# For epoched datasets.

  setkeyv(input.table, class.col)

  chan.split <- split(input.table,input.table[,get(split.col)])

  chan.d <- cbind(lapply(chan.split, function(x){
    x[,value.col,with=FALSE]}))

  chan.d <- as.data.table(matrix(unlist(chan.d),
                            ncol = input.table[,length(unique(get(split.col)))], 
                            byrow=TRUE))

  # reintroduce class labels
  # since the split is over identical sections for each channel, we can just use
  # the first split's labels
  chan.d <- chan.d[,c(class.col):= chan.split[[1]][,get(class.col)]]
  chan.d[,c(class.col):=as.factor(get(class.col))]

  # similarly with time and trial labels
  chan.d <- chan.d[,Time:= chan.split[[1]][,get(time.col)]]
  chan.d <- chan.d[,Trial:= chan.split[[1]][,get(trial.col)]]

  return(chan.d) 
}

Using this function, I take some multivariate trials that I have prepared into a long data.table like the one at the top, and reshape them into a wider one that looks like this:

使用这个函数,我将一些我已经准备好的多变量试验放入一个长数据表中,如顶部的那个,并将它们重新整形为更宽的一个,如下所示:

> this.data.training.channel
              V1        V2        V3        V4 Class       Time Trial
    1: -50.58389 -50.56397 -50.74251 -50.86700    -1 -100.00000     1
    2: -50.92713 -50.28009 -50.15078 -50.70161    -1  -96.07843     2
    3: -50.84276 -50.02456 -50.20015 -50.45228    -1  -76.47059     7
    4: -50.68679 -50.05475 -50.04270 -50.83900    -1  -72.54902     8
    5: -50.55954 -50.88998 -50.01273 -50.86856    -1  -68.62745     9
   ---                                                               
35836:  49.52361  49.37465  49.73997  49.10543     1  876.47059   194
35837:  49.93162  49.38352  49.62406  49.16854     1  888.23529   197
35838:  49.67510  49.63853  49.54259  49.81198     1  892.15686   198
35839:  49.26295  49.98449  49.60437  49.03918     1  896.07843   199
35840:  49.05030  49.42035  49.48546  49.73438     1  900.00000   200

At this point, I take the widened table and give it to a classifier like lda(), then test it on a separate random portion of the same data:

此时,我将扩展表放到像lda()这样的分类器中,然后在相同数据的单独随机部分上进行测试:

lda.model <- lda(Class ~ . -Trial, this.data.training.channel)
lda.pred <- predict(lda.model, this.data.testing.channel)

symptoms

However, even if I generate obscenely separated dummy data (see picture below), I am getting near-chance results with existing reasonable libraries. (I know the libraries are probably not at fault, because if I allow the algorithm to use the trial index as a training feature, it correctly classifies every input.)

但是,即使我生成了淫秽分离的虚拟数据(见下图),我现有合理的库也会获得近乎可能的结果。 (我知道库可能没有错,因为如果我允许算法使用试验索引作为训练功能,它会正确地对每个输入进行分类。)

这种从长到宽的重塑我做错了什么?

> table(predicted = lda.pred$class, data = this.data.testing.channel[,Class])
         data
predicted   -1    1
       -1 2119 1878
       1  5817 5546

> 1-sum(lda.pred$class != this.data.testing.channel[,Class])/length(lda.pred$class)
[1] 0.4984375

> table(predicted = sda.pred$class, data = this.data.testing.channel[,Class])
         data
predicted   -1    1
       -1 3705 3969
       1  3719 3967

> 1-sum(sda.pred$class != this.data.testing.channel[,Class])/length(sda.pred$class)
[1] 0.4994792

The error rate is basically a coin flip, despite the values from class 1 being about 50 times the values from class -1. I have to be making some huge mistake (which I think is a programming one, otherwise I would be over on cross validated), but I have spent days prodding it and rewriting code with no improvement. (As an example, note that I get the same result whether or not I scale the input values so that they have mean 0, variance 1.)

错误率基本上是硬币翻转,尽管第1类的值大约是-1级值的50倍。我必须犯一些巨大的错误(我认为这是一个编程问题,否则我会在交叉验证时结束),但我花了几天时间来刺激它并重写代码而没有任何改进。 (作为一个例子,请注意我得到相同的结果,无论我是否缩放输入值,使它们具有均值0,方差1。)

reproducing the problem

A complete gist that can be run to reproduce the problem is available here.

此处提供了可以运行以重现问题的完整要点。

possible problems I considered, what I tried

(see previous revisions of the question for the full list, due to length considerations)

(由于篇幅考虑,请参阅完整列表的问题的先前修订版)

I wrote a function (included in the gist) to generate easily separable dummy data, and wrote another function to average each of the two classes, faceted by Channel and colored by Class, like the plot above. Playing with each of the parameters (difference in population means, channel count, etc.) seems to produce expected output, as well as peeking at appropriate subsets using calls like this.data[Trial==1,unique(Time),by=Subject].

我编写了一个函数(包含在要点中)来生成易于分离的虚拟数据,并编写了另一个函数来平均每个类,由Channel划分并按类着色,如上图所示。使用每个参数(总体均值,通道数等的差异)似乎产生预期输出,以及使用类似this.data之类的调用查看适当的子集[Trial == 1,unique(Time),by =学科]。

what do I need to solve this?

I would greatly appreciate any advice on fixing this. I just can't see what I'm doing wrong.

我非常感谢有关解决这个问题的任何建议。我只是看不出我做错了什么。

If someone either diagnosed/located the issue, or was able to illustrate, using a different approach, a reshaped table from the data that worked with these (popular) classifier functions, I wouldn't just accept, I would award a bounty (after testing, of course).

如果有人诊断/找到了问题,或者能够使用不同的方法来说明,使用这些(流行的)分类器功能的数据重新整形的表格,我不会接受,我会奖励赏金(之后)当然是测试)。

session info

R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  grid      stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] doMC_1.3.2              iterators_1.0.6         AUC_0.3.0              
 [4] LiblineaR_1.80-7        RcppRoll_0.1.0          RcppArmadillo_0.4.300.0
 [7] Rcpp_0.11.1             foreach_1.4.1           cvTools_0.3.2          
[10] robustbase_0.90-2       latticist_0.9-44        vcd_1.3-1              
[13] latticeExtra_0.6-26     lattice_0.20-29         pheatmap_0.7.7         
[16] RColorBrewer_1.0-5      klaR_0.6-10             MASS_7.3-29            
[19] ggplot2_0.9.3.1         reshape2_1.2.2          data.table_1.9.2       
[22] sda_1.3.3               fdrtool_1.2.12          corpcor_1.6.6          
[25] entropy_1.2.0           zoo_1.7-11              testthat_0.8           

loaded via a namespace (and not attached):
 [1] codetools_0.2-8  colorspace_1.2-4 combinat_0.0-8   compiler_3.0.2   DEoptimR_1.0-1  
 [6] dichromat_2.0-0  digest_0.6.4     gtable_0.1.2     gWidgets_0.0-52  labeling_0.2    
[11] munsell_0.4.2    plyr_1.8         proto_0.3-10     scales_0.2.3     stringr_0.6.2   
[16] tools_3.0.2   

2 个解决方案

#1


2  

I could not reproduce your error and I found some problems with dummy.plug(). I generated data with

我无法重现您的错误,我发现dummy.plug()存在一些问题。我生成了数据

library(data.table)
library(reshape2)
library("MASS")

set.seed(115)
pp<-dummy.plug(trial.count = 200,
    chan.count = 4,
    mean.diff = 100,
    value.name = "Value")

And I don't care for data.table so i just converted it to a basic data.frame.

我不关心data.table所以我只是将它转换为基本的data.frame。

dd<-as.data.frame(pp)

Now you say that Time, Trial, and Channel should uniquely identify a value, but that does not seem to be the case in the dummy data. I see that

现在你说Time,Trial和Channel应该唯一地标识一个值,但虚拟数据中似乎不是这种情况。我看到

subset(dd, Time==-100 & Trial==1 & Channel=="V1")

#       Time Trial Class Channel      Value
# 1     -100     1    -1      V1 0.73642916
# 6401  -100     1    -1      V1 0.17648939
# 12801 -100     1    -1      V1 0.41366964
# 19201 -100     1    -1      V1 0.07044473
# 25601 -100     1    -1      V1 0.86583284
# 32001 -100     1    -1      V1 0.24255411
# 38401 -100     1    -1      V1 0.92473225
# 44801 -100     1    -1      V1 0.69989600

So there are clearly multiple values for each combination. So to proceed, I decided just to take the mean of the observed values. I had no problems using dcast with

因此,每种组合显然有多个值。所以继续,我决定采取观察值的平均值。使用dcast时我没有遇到任何问题

xx<-dcast(dd, Class+Time+Trial~Channel, fun.aggregate=mean)

Then I split up the training/test datasets

然后我拆分训练/测试数据集

train.trials = sample(unique(dd$Trial), 140)
train.data = subset(xx, Trial %in% train.trials)
test.data = subset(xx, !Trial %in% train.trials)

Then I ran lda as above

然后我如上所述运行lda

lda.model <- lda(Class ~ . -Trial, train.data)
lda.pred <- predict(lda.model, test.data)

And I checked out how I did

我查看了我的表现

table(lda.pred$class, test.data$Class)
#        -1    1
#   -1  704    0
#   1     0 1216

And I appear to do much better than you did.

我似乎做得比你做得好得多。

Unless something bad happened when i converted the data.table to a data.frame, there seems to be problems with your test data. Perhaps there is a problem with your non-cast reshape function. Seeing as how dcast works just fine, maybe you want to check that your function works as well.

除非在将data.table转换为data.frame时发生了一些不好的事情,否则您的测试数据似乎存在问题。也许您的非转换重塑功能存在问题。看看dcast如何工作得很好,也许你想检查你的功能是否也能正常工作。

#2


1  

MrFlick was right on both counts. For the sake of completeness, here is a data.table answer with some extra explanation.

MrFlick在两个方面都是正确的。为了完整起见,这里有一个data.table答案和一些额外的解释。

bad dummy data function

The dummy function in the above gist was indeed bad; the crucial lines are these:

上述要点中的虚函数确实很糟糕;关键的是这些:

  dummy.data <- data.table(matrix(runif(length(time.vector)*trial.count*chan.count),
                                  ncol=chan.count),
                           Time=rep(time.vector,times = trial.count))
  setkey(dummy.data,Time)
  dummy.data <- dummy.data[,Trial:=seq_len(trial.count)]

Since Trial will be recycled down the table once set, every other column has to match the same permutation (wrap over Trial values). A fast way to do this is sorting by Time, which is one of the effects of setkey(). Once this is done, the data can indeed be uniquely sorted:

由于试验将在设置后再循环到表中,因此每个其他列必须匹配相同的排列(包裹试验值)。一种快速的方法是按时间排序,这是setkey()的影响之一。完成此操作后,数据确实可以进行唯一排序:

# load dummy data
set.seed(115)
this.data <- dummy.plug(trial.count = 200,
                       chan.count = 4,
                       mean.diff = 50,
                       value.name = "Value")
> this.data[(Trial==1 & Channel=="V1" & Time == -100),]
   Time Trial Class Channel     Value
1: -100     1    -1      V1 0.7364292

dcast works now

Now that the uniqueness criterion is satisfied, dcast works on the data table:

现在满足唯一性标准,dcast在数据表上工作:

> this.data.channel <- dcast.data.table(this.data,
+                                                Class+Time+Trial~Channel,
+                                                fun.aggregate=identity)
Using 'Value' as value column. Use 'value.var' to override
> this.data.channel
       Class Time Trial           V1         V2         V3          V4
    1:    -1 -100     1 7.364292e-01  0.8889176  0.4638730  0.61258621
    2:    -1 -100     2 9.030099e-02  0.1435559  0.1596734  0.88577669
    3:    -1 -100     3 6.685920e-01  0.1013146  0.7156151  0.51144831
    4:    -1 -100     4 9.154142e-04  0.2429634  0.3169072  0.05810808
    5:    -1 -100     5 7.383397e-01  0.3668977  0.3779892  0.34938949
   ---                                                                
51196:     1  900   196 5.028103e+01 50.2810276 50.2810276 50.28102761
51197:     1  900   197 5.080229e+01 50.8022872 50.8022872 50.80228716
51198:     1  900   198 5.084255e+01 50.8425466 50.8425466 50.84254662
51199:     1  900   199 5.096859e+01 50.9685913 50.9685913 50.96859133
51200:     1  900   200 5.034459e+01 50.3445878 50.3445878 50.34458784

You can quickly spot check to see that this worked properly:

您可以快速检查以确定这是否正常工作:

> this.data.channel[,unique(Trial),by=Class]
     Class  V1
  1:    -1   1
  2:    -1   2
  3:    -1   3
  4:    -1   4
  5:    -1   5
 ---          
196:     1 196
197:     1 197
198:     1 198
199:     1 199
200:     1 200

classification check

The remainder of the gist works, as does MrFlick's snippet.

其余的要点和MrFlick的片段一样有效。

> lda.model <- lda(Class ~ . -Trial, this.data.training.channel)
> lda.pred <- predict(lda.model, this.data.testing.channel)
> table(predicted = lda.pred$class, data = this.data.testing.channel[,Class])
         data
predicted   -1    1
       -1 5888    0
       1     0 9472
> 1-sum(lda.pred$class != this.data.testing.channel[,Class])/length(lda.pred$class)
[1] 1

Why I couldn't get dcast working before is something I will have to dig into an old revision to look at. I suspect a permutation problem (during import instead of generation) like the above contributed to it.

为什么我以前无法让dcast工作,我将不得不深入研究旧版本。我怀疑一个排列问题(在导入而不是生成期间)就像上面所做的那样。

#1


2  

I could not reproduce your error and I found some problems with dummy.plug(). I generated data with

我无法重现您的错误,我发现dummy.plug()存在一些问题。我生成了数据

library(data.table)
library(reshape2)
library("MASS")

set.seed(115)
pp<-dummy.plug(trial.count = 200,
    chan.count = 4,
    mean.diff = 100,
    value.name = "Value")

And I don't care for data.table so i just converted it to a basic data.frame.

我不关心data.table所以我只是将它转换为基本的data.frame。

dd<-as.data.frame(pp)

Now you say that Time, Trial, and Channel should uniquely identify a value, but that does not seem to be the case in the dummy data. I see that

现在你说Time,Trial和Channel应该唯一地标识一个值,但虚拟数据中似乎不是这种情况。我看到

subset(dd, Time==-100 & Trial==1 & Channel=="V1")

#       Time Trial Class Channel      Value
# 1     -100     1    -1      V1 0.73642916
# 6401  -100     1    -1      V1 0.17648939
# 12801 -100     1    -1      V1 0.41366964
# 19201 -100     1    -1      V1 0.07044473
# 25601 -100     1    -1      V1 0.86583284
# 32001 -100     1    -1      V1 0.24255411
# 38401 -100     1    -1      V1 0.92473225
# 44801 -100     1    -1      V1 0.69989600

So there are clearly multiple values for each combination. So to proceed, I decided just to take the mean of the observed values. I had no problems using dcast with

因此,每种组合显然有多个值。所以继续,我决定采取观察值的平均值。使用dcast时我没有遇到任何问题

xx<-dcast(dd, Class+Time+Trial~Channel, fun.aggregate=mean)

Then I split up the training/test datasets

然后我拆分训练/测试数据集

train.trials = sample(unique(dd$Trial), 140)
train.data = subset(xx, Trial %in% train.trials)
test.data = subset(xx, !Trial %in% train.trials)

Then I ran lda as above

然后我如上所述运行lda

lda.model <- lda(Class ~ . -Trial, train.data)
lda.pred <- predict(lda.model, test.data)

And I checked out how I did

我查看了我的表现

table(lda.pred$class, test.data$Class)
#        -1    1
#   -1  704    0
#   1     0 1216

And I appear to do much better than you did.

我似乎做得比你做得好得多。

Unless something bad happened when i converted the data.table to a data.frame, there seems to be problems with your test data. Perhaps there is a problem with your non-cast reshape function. Seeing as how dcast works just fine, maybe you want to check that your function works as well.

除非在将data.table转换为data.frame时发生了一些不好的事情,否则您的测试数据似乎存在问题。也许您的非转换重塑功能存在问题。看看dcast如何工作得很好,也许你想检查你的功能是否也能正常工作。

#2


1  

MrFlick was right on both counts. For the sake of completeness, here is a data.table answer with some extra explanation.

MrFlick在两个方面都是正确的。为了完整起见,这里有一个data.table答案和一些额外的解释。

bad dummy data function

The dummy function in the above gist was indeed bad; the crucial lines are these:

上述要点中的虚函数确实很糟糕;关键的是这些:

  dummy.data <- data.table(matrix(runif(length(time.vector)*trial.count*chan.count),
                                  ncol=chan.count),
                           Time=rep(time.vector,times = trial.count))
  setkey(dummy.data,Time)
  dummy.data <- dummy.data[,Trial:=seq_len(trial.count)]

Since Trial will be recycled down the table once set, every other column has to match the same permutation (wrap over Trial values). A fast way to do this is sorting by Time, which is one of the effects of setkey(). Once this is done, the data can indeed be uniquely sorted:

由于试验将在设置后再循环到表中,因此每个其他列必须匹配相同的排列(包裹试验值)。一种快速的方法是按时间排序,这是setkey()的影响之一。完成此操作后,数据确实可以进行唯一排序:

# load dummy data
set.seed(115)
this.data <- dummy.plug(trial.count = 200,
                       chan.count = 4,
                       mean.diff = 50,
                       value.name = "Value")
> this.data[(Trial==1 & Channel=="V1" & Time == -100),]
   Time Trial Class Channel     Value
1: -100     1    -1      V1 0.7364292

dcast works now

Now that the uniqueness criterion is satisfied, dcast works on the data table:

现在满足唯一性标准,dcast在数据表上工作:

> this.data.channel <- dcast.data.table(this.data,
+                                                Class+Time+Trial~Channel,
+                                                fun.aggregate=identity)
Using 'Value' as value column. Use 'value.var' to override
> this.data.channel
       Class Time Trial           V1         V2         V3          V4
    1:    -1 -100     1 7.364292e-01  0.8889176  0.4638730  0.61258621
    2:    -1 -100     2 9.030099e-02  0.1435559  0.1596734  0.88577669
    3:    -1 -100     3 6.685920e-01  0.1013146  0.7156151  0.51144831
    4:    -1 -100     4 9.154142e-04  0.2429634  0.3169072  0.05810808
    5:    -1 -100     5 7.383397e-01  0.3668977  0.3779892  0.34938949
   ---                                                                
51196:     1  900   196 5.028103e+01 50.2810276 50.2810276 50.28102761
51197:     1  900   197 5.080229e+01 50.8022872 50.8022872 50.80228716
51198:     1  900   198 5.084255e+01 50.8425466 50.8425466 50.84254662
51199:     1  900   199 5.096859e+01 50.9685913 50.9685913 50.96859133
51200:     1  900   200 5.034459e+01 50.3445878 50.3445878 50.34458784

You can quickly spot check to see that this worked properly:

您可以快速检查以确定这是否正常工作:

> this.data.channel[,unique(Trial),by=Class]
     Class  V1
  1:    -1   1
  2:    -1   2
  3:    -1   3
  4:    -1   4
  5:    -1   5
 ---          
196:     1 196
197:     1 197
198:     1 198
199:     1 199
200:     1 200

classification check

The remainder of the gist works, as does MrFlick's snippet.

其余的要点和MrFlick的片段一样有效。

> lda.model <- lda(Class ~ . -Trial, this.data.training.channel)
> lda.pred <- predict(lda.model, this.data.testing.channel)
> table(predicted = lda.pred$class, data = this.data.testing.channel[,Class])
         data
predicted   -1    1
       -1 5888    0
       1     0 9472
> 1-sum(lda.pred$class != this.data.testing.channel[,Class])/length(lda.pred$class)
[1] 1

Why I couldn't get dcast working before is something I will have to dig into an old revision to look at. I suspect a permutation problem (during import instead of generation) like the above contributed to it.

为什么我以前无法让dcast工作,我将不得不深入研究旧版本。我怀疑一个排列问题(在导入而不是生成期间)就像上面所做的那样。