时间序列-数据分割和模型评估

时间:2023-01-03 16:55:28

I've tried to use machine learning to make prediction based on time-series data. In one of the * question (createTimeSlices function in CARET package in R) is an example of using createTimeSlices to cross-validation for model training and parameter tuning:

我尝试用机器学习来基于时间序列数据进行预测。*问题之一(CARET包中的createTimeSlices函数在R中)是使用createTimeSlices交叉验证模型训练和参数调优的一个例子:

    library(caret)
    library(ggplot2)
    library(pls)
    data(economics)
    myTimeControl <- trainControl(method = "timeslice",
                                  initialWindow = 36,
                                  horizon = 12,
                                  fixedWindow = TRUE)

    plsFitTime <- train(unemploy ~ pce + pop + psavert,
                        data = economics,
                        method = "pls",
                        preProc = c("center", "scale"),
                        trControl = myTimeControl)

My understanding is:

我的理解是:

  1. I need to split may data to training and test set.
  2. 我需要将may数据分割到培训和测试集。
  3. Use training set for parameters tuning.
  4. 使用训练集进行参数调整。
  5. Evaluate obtained model on the test set (using R2, RMSE, etc.)
  6. 在测试集上评估获得的模型(使用R2、RMSE等)。

Because my data is time-series, I suppose that I cannot use bootstraping for spliting data into training and test set. So, my questions are: Am I right? And If so - How to use createTimeSlices for model evaluation?

因为我的数据是时间序列的,所以我认为我不能使用bootstrapping来将数据分割到训练和测试集中。所以,我的问题是:我是对的吗?如果是,如何使用createTimeSlices进行模型评估?

3 个解决方案

#1


28  

Note that the original question that you have posted, takes care of the timeSlicing, and you don't have to create timeSlices by hand.

注意,您发布的原始问题负责分时处理,您不必手工创建分时处理。

However, here is how to use createTimeSlices for splitting the data and then using it for training and testing a model.

然而,这里介绍了如何使用createTimeSlices来分割数据,然后将其用于培训和测试模型。

Step 0: Setting up the data and trainControl:(from your question)

步骤0:设置数据和trainControl(来自您的问题)

library(caret)
library(ggplot2)
library(pls)

data(economics)

Step 1: Creating the timeSlices for the index of the data:

步骤1:为数据索引创建时间片:

timeSlices <- createTimeSlices(1:nrow(economics), 
                   initialWindow = 36, horizon = 12, fixedWindow = TRUE)

This creates a list of training and testing timeSlices.

这将创建一个培训和测试时间表的列表。

> str(timeSlices,max.level = 1)
## List of 2
## $ train:List of 431
##   .. [list output truncated]
## $ test :List of 431
##   .. [list output truncated]

For ease of understanding, I am saving them in separate variable:

为了便于理解,我将它们保存在单独的变量中:

trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]

Step 2: Training on the first of the trainSlices:

第二步:第一步训练:

plsFitTime <- train(unemploy ~ pce + pop + psavert,
                    data = economics[trainSlices[[1]],],
                    method = "pls",
                    preProc = c("center", "scale"))

Step 3: Testing on the first of the trainSlices:

步骤3:测试第一个trainslice:

pred <- predict(plsFitTime,economics[testSlices[[1]],])

Step 4: Plotting:

步骤4:策划:

true <- economics$unemploy[testSlices[[1]]]

plot(true, col = "red", ylab = "true (red) , pred (blue)", ylim = range(c(pred,true)))
points(pred, col = "blue") 

You can then do this for all the slices:

你可以对所有的切片都这样做:

for(i in 1:length(trainSlices)){
  plsFitTime <- train(unemploy ~ pce + pop + psavert,
                      data = economics[trainSlices[[i]],],
                      method = "pls",
                      preProc = c("center", "scale"))
  pred <- predict(plsFitTime,economics[testSlices[[i]],])


  true <- economics$unemploy[testSlices[[i]]]
  plot(true, col = "red", ylab = "true (red) , pred (blue)", 
            main = i, ylim = range(c(pred,true)))
  points(pred, col = "blue") 
}

As mentioned earlier, this sort of timeSlicing is done by your original function in one step:

如前所述,这类时间片是由您的原始函数一步完成的:

> myTimeControl <- trainControl(method = "timeslice",
+                               initialWindow = 36,
+                               horizon = 12,
+                               fixedWindow = TRUE)
> 
> plsFitTime <- train(unemploy ~ pce + pop + psavert,
+                     data = economics,
+                     method = "pls",
+                     preProc = c("center", "scale"),
+                     trControl = myTimeControl)
> plsFitTime
Partial Least Squares 

478 samples
  5 predictors

Pre-processing: centered, scaled 
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window) 

Summary of sample sizes: 36, 36, 36, 36, 36, 36, ... 

Resampling results across tuning parameters:

  ncomp  RMSE  Rsquared  RMSE SD  Rsquared SD
  1      1080  0.443     796      0.297      
  2      1090  0.43      845      0.295      

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 1. 

Hope this helps!!

希望这有助于! !

#2


4  

Shambho's answer provides decent example of how to use the caret package with TimeSlices, however, it can be misleading in terms of modelling technique. So in order not to misguide future readers that want to use the caret package for predictive modelling on time-series (and here I do not mean autoregressive models), I want to highlight a few things.

Shambho的答案提供了如何使用带有时间片的插入符号包的很好的示例,但是,在建模技术方面可能会产生误导。因此,为了不误导那些希望使用插入符号包对时间序列进行预测建模的未来读者(这里我并不是指自回归模型),我想强调一些东西。

The problem with time-series data is that look-ahead bias is easy if one is not careful. In this case, the economics data set has aligned data at their economic reporting dates and not their release date, which is never the case in real live applications (economic data points have different time stamps). Unemployment data may be two months behind the other indicators in terms of release date, which would then introduce a model bias in Shambho's example.

时间序列数据的问题在于,如果一个人不小心,就很容易出现超前偏差。在这种情况下,经济数据集在其经济报告日期而不是发布日期上对齐数据,这在实际的实时应用程序中是不可能的(经济数据点有不同的时间戳)。在发布日期方面,失业数据可能比其他指标晚两个月,这将在Shambho的例子中引入一个模型偏差。

Next, this example is only descriptive statistics and not predictive (forecasting) because the data we want to forecast (unemploy) is not lagged correctly. It merely trains a model to best explain the variation in unemployment (which also in this case is a stationary time-series creating all sorts of issues in modelling process) based on predictor variables at the same economic report dates.

接下来,这个示例仅仅是描述性统计,而不是预测性(预测),因为我们希望预测(未使用)的数据没有正确地滞后。它仅仅训练一个模型来最好地解释失业的变化(在这种情况下,失业也是一个固定的时间序列,在建模过程中产生各种各样的问题),基于预测变量在相同的经济报告日期。

Lastly, the 12-month horizon in this example is not a true multi-period forecasting as Hyndman does it in his examples.

最后,本例中的12个月周期并不是像Hyndman在他的例子中所做的那样,真正的多周期预测。

Hyndman on cross-validation for time-series

Hyndman关于时间序列的交叉验证

#3


2  

Actually, you can!

实际上,你可以!

First, let me give you a scholarly article on the topic.

首先,让我给你一篇关于这个话题的学术文章。

In R:

R:

Using the package caret, createResample can be used to make simple bootstrap samples and createFolds can be used to generate balanced cross–validation groupings from a set of data. So you'll probably want to use createResample. Here's an example of its usage:

使用包插入符号,createResample可以用于创建简单的引导示例,createFolds可以用于从一组数据生成平衡的交叉验证分组。所以你可能想要使用createResample。这里有一个使用它的例子:

data(oil)
createDataPartition(oilType, 2)

x <- rgamma(50, 3, .5)
inA <- createDataPartition(x, list = FALSE)

plot(density(x[inA]))
rug(x[inA])

points(density(x[-inA]), type = "l", col = 4)
rug(x[-inA], col = 4)

createResample(oilType, 2)

createFolds(oilType, 10)
createFolds(oilType, 5, FALSE)

createFolds(rnorm(21))

createTimeSlices(1:9, 5, 1, fixedWindow = FALSE)
createTimeSlices(1:9, 5, 1, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = FALSE)

The values you see in the createResample function are the data and the number of partitions to create, in this case 2. You can additionally specify if the results should be stored as a list with list = TRUE or list = FALSE.

您在createResample函数中看到的值是数据和创建的分区数量,在本例中为2。您还可以指定结果是否应该存储为list = TRUE或list = FALSE的列表。

Additionally, caret contains a function called createTimeSlices that can create the indices for this type of splitting.

此外,插入符号包含一个名为createTimeSlices的函数,可以为这种类型的分割创建索引。

The three parameters for this type of splitting are:

这类分裂的三个参数是:

  • initialWindow: the initial number of consecutive values in each training set sample
  • initialWindow:每个训练集样本中连续值的初始数量
  • horizon: The number of consecutive values in test set sample
  • 视界:测试集样本中连续值的个数。
  • fixedWindow: A logical: if FALSE, the training set always start at the first sample and the training set size will vary over data splits.
  • 一个逻辑:如果错误,训练集总是从第一个样本开始,训练集的大小会随着数据的分割而变化。

Usage:

用法:

createDataPartition(y, 
                    times = 1,
                    p = 0.5,
                    list = TRUE,
                    groups = min(5, length(y)))
createResample(y, times = 10, list = TRUE)
createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)
createMultiFolds(y, k = 10, times = 5)
createTimeSlices(y, initialWindow, horizon = 1, fixedWindow = TRUE)

Sources:

来源:

http://caret.r-forge.r-project.org/splitting.html

http://caret.r-forge.r-project.org/splitting.html

http://eranraviv.com/blog/bootstrapping-time-series-r-code/

http://eranraviv.com/blog/bootstrapping-time-series-r-code/

http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=caret/man/createDataPartition.Rd&d=R_CC

http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=caret/man/createDataPartition.Rd&d=R_CC

CARET. Relationship between data splitting and trainControl

插入符号。数据分割和trainControl之间的关系

#1


28  

Note that the original question that you have posted, takes care of the timeSlicing, and you don't have to create timeSlices by hand.

注意,您发布的原始问题负责分时处理,您不必手工创建分时处理。

However, here is how to use createTimeSlices for splitting the data and then using it for training and testing a model.

然而,这里介绍了如何使用createTimeSlices来分割数据,然后将其用于培训和测试模型。

Step 0: Setting up the data and trainControl:(from your question)

步骤0:设置数据和trainControl(来自您的问题)

library(caret)
library(ggplot2)
library(pls)

data(economics)

Step 1: Creating the timeSlices for the index of the data:

步骤1:为数据索引创建时间片:

timeSlices <- createTimeSlices(1:nrow(economics), 
                   initialWindow = 36, horizon = 12, fixedWindow = TRUE)

This creates a list of training and testing timeSlices.

这将创建一个培训和测试时间表的列表。

> str(timeSlices,max.level = 1)
## List of 2
## $ train:List of 431
##   .. [list output truncated]
## $ test :List of 431
##   .. [list output truncated]

For ease of understanding, I am saving them in separate variable:

为了便于理解,我将它们保存在单独的变量中:

trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]

Step 2: Training on the first of the trainSlices:

第二步:第一步训练:

plsFitTime <- train(unemploy ~ pce + pop + psavert,
                    data = economics[trainSlices[[1]],],
                    method = "pls",
                    preProc = c("center", "scale"))

Step 3: Testing on the first of the trainSlices:

步骤3:测试第一个trainslice:

pred <- predict(plsFitTime,economics[testSlices[[1]],])

Step 4: Plotting:

步骤4:策划:

true <- economics$unemploy[testSlices[[1]]]

plot(true, col = "red", ylab = "true (red) , pred (blue)", ylim = range(c(pred,true)))
points(pred, col = "blue") 

You can then do this for all the slices:

你可以对所有的切片都这样做:

for(i in 1:length(trainSlices)){
  plsFitTime <- train(unemploy ~ pce + pop + psavert,
                      data = economics[trainSlices[[i]],],
                      method = "pls",
                      preProc = c("center", "scale"))
  pred <- predict(plsFitTime,economics[testSlices[[i]],])


  true <- economics$unemploy[testSlices[[i]]]
  plot(true, col = "red", ylab = "true (red) , pred (blue)", 
            main = i, ylim = range(c(pred,true)))
  points(pred, col = "blue") 
}

As mentioned earlier, this sort of timeSlicing is done by your original function in one step:

如前所述,这类时间片是由您的原始函数一步完成的:

> myTimeControl <- trainControl(method = "timeslice",
+                               initialWindow = 36,
+                               horizon = 12,
+                               fixedWindow = TRUE)
> 
> plsFitTime <- train(unemploy ~ pce + pop + psavert,
+                     data = economics,
+                     method = "pls",
+                     preProc = c("center", "scale"),
+                     trControl = myTimeControl)
> plsFitTime
Partial Least Squares 

478 samples
  5 predictors

Pre-processing: centered, scaled 
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window) 

Summary of sample sizes: 36, 36, 36, 36, 36, 36, ... 

Resampling results across tuning parameters:

  ncomp  RMSE  Rsquared  RMSE SD  Rsquared SD
  1      1080  0.443     796      0.297      
  2      1090  0.43      845      0.295      

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 1. 

Hope this helps!!

希望这有助于! !

#2


4  

Shambho's answer provides decent example of how to use the caret package with TimeSlices, however, it can be misleading in terms of modelling technique. So in order not to misguide future readers that want to use the caret package for predictive modelling on time-series (and here I do not mean autoregressive models), I want to highlight a few things.

Shambho的答案提供了如何使用带有时间片的插入符号包的很好的示例,但是,在建模技术方面可能会产生误导。因此,为了不误导那些希望使用插入符号包对时间序列进行预测建模的未来读者(这里我并不是指自回归模型),我想强调一些东西。

The problem with time-series data is that look-ahead bias is easy if one is not careful. In this case, the economics data set has aligned data at their economic reporting dates and not their release date, which is never the case in real live applications (economic data points have different time stamps). Unemployment data may be two months behind the other indicators in terms of release date, which would then introduce a model bias in Shambho's example.

时间序列数据的问题在于,如果一个人不小心,就很容易出现超前偏差。在这种情况下,经济数据集在其经济报告日期而不是发布日期上对齐数据,这在实际的实时应用程序中是不可能的(经济数据点有不同的时间戳)。在发布日期方面,失业数据可能比其他指标晚两个月,这将在Shambho的例子中引入一个模型偏差。

Next, this example is only descriptive statistics and not predictive (forecasting) because the data we want to forecast (unemploy) is not lagged correctly. It merely trains a model to best explain the variation in unemployment (which also in this case is a stationary time-series creating all sorts of issues in modelling process) based on predictor variables at the same economic report dates.

接下来,这个示例仅仅是描述性统计,而不是预测性(预测),因为我们希望预测(未使用)的数据没有正确地滞后。它仅仅训练一个模型来最好地解释失业的变化(在这种情况下,失业也是一个固定的时间序列,在建模过程中产生各种各样的问题),基于预测变量在相同的经济报告日期。

Lastly, the 12-month horizon in this example is not a true multi-period forecasting as Hyndman does it in his examples.

最后,本例中的12个月周期并不是像Hyndman在他的例子中所做的那样,真正的多周期预测。

Hyndman on cross-validation for time-series

Hyndman关于时间序列的交叉验证

#3


2  

Actually, you can!

实际上,你可以!

First, let me give you a scholarly article on the topic.

首先,让我给你一篇关于这个话题的学术文章。

In R:

R:

Using the package caret, createResample can be used to make simple bootstrap samples and createFolds can be used to generate balanced cross–validation groupings from a set of data. So you'll probably want to use createResample. Here's an example of its usage:

使用包插入符号,createResample可以用于创建简单的引导示例,createFolds可以用于从一组数据生成平衡的交叉验证分组。所以你可能想要使用createResample。这里有一个使用它的例子:

data(oil)
createDataPartition(oilType, 2)

x <- rgamma(50, 3, .5)
inA <- createDataPartition(x, list = FALSE)

plot(density(x[inA]))
rug(x[inA])

points(density(x[-inA]), type = "l", col = 4)
rug(x[-inA], col = 4)

createResample(oilType, 2)

createFolds(oilType, 10)
createFolds(oilType, 5, FALSE)

createFolds(rnorm(21))

createTimeSlices(1:9, 5, 1, fixedWindow = FALSE)
createTimeSlices(1:9, 5, 1, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = FALSE)

The values you see in the createResample function are the data and the number of partitions to create, in this case 2. You can additionally specify if the results should be stored as a list with list = TRUE or list = FALSE.

您在createResample函数中看到的值是数据和创建的分区数量,在本例中为2。您还可以指定结果是否应该存储为list = TRUE或list = FALSE的列表。

Additionally, caret contains a function called createTimeSlices that can create the indices for this type of splitting.

此外,插入符号包含一个名为createTimeSlices的函数,可以为这种类型的分割创建索引。

The three parameters for this type of splitting are:

这类分裂的三个参数是:

  • initialWindow: the initial number of consecutive values in each training set sample
  • initialWindow:每个训练集样本中连续值的初始数量
  • horizon: The number of consecutive values in test set sample
  • 视界:测试集样本中连续值的个数。
  • fixedWindow: A logical: if FALSE, the training set always start at the first sample and the training set size will vary over data splits.
  • 一个逻辑:如果错误,训练集总是从第一个样本开始,训练集的大小会随着数据的分割而变化。

Usage:

用法:

createDataPartition(y, 
                    times = 1,
                    p = 0.5,
                    list = TRUE,
                    groups = min(5, length(y)))
createResample(y, times = 10, list = TRUE)
createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)
createMultiFolds(y, k = 10, times = 5)
createTimeSlices(y, initialWindow, horizon = 1, fixedWindow = TRUE)

Sources:

来源:

http://caret.r-forge.r-project.org/splitting.html

http://caret.r-forge.r-project.org/splitting.html

http://eranraviv.com/blog/bootstrapping-time-series-r-code/

http://eranraviv.com/blog/bootstrapping-time-series-r-code/

http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=caret/man/createDataPartition.Rd&d=R_CC

http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=caret/man/createDataPartition.Rd&d=R_CC

CARET. Relationship between data splitting and trainControl

插入符号。数据分割和trainControl之间的关系