R在时间序列NA中设置已经包含NA的异常值

时间:2021-10-01 05:54:28

I have a time series containing NAs and some sudden jumps like this:

我有一个包含NA的时间序列和一些像这样的突然跳转:

input=c(1:5, NA, 6:7,0,9:12)

In which 7,0,9 would be considered a jump for which 0 should be replaced by NA.

其中7,0,9被认为是跳跃,其中0应该被NA替换。

I would like to remove the very first value in which a sudden jump (with set value of what qualifies as a jump, in the example a change > 1) occurs and set it NA

我想删除突然跳转的第一个值(设置值为跳转的设置值,在示例中为更改> 1)并将其设置为NA

the output for the example should look like this:

示例的输出应如下所示:

output=c(1:5,NA,6:7,NA,9:12)

I only want to set outliers NA, I do not want to overwrite the remaining values. The jump can be both negative and positive.

我只想设置异常值NA,我不想覆盖剩余的值。跳跃可以是消极的也可以是积极的。

Problems I encountered:

我遇到的问题:

  1. The value after an existing NA value being counted as a jump
  2. 现有NA值被计为跳转后的值

  3. The "jump back" after the outlier being counted as jump
  4. 离群值被计为跳跃后“跳回”

both of which resulted in more than necessary NAs, I try to keep as much original data as possible.

两者都产生了超过必要的NA,我尽量保留尽可能多的原始数据。

Any ideas? I have been stuck for a while. Thanks in advance!

有任何想法吗?我被困了一段时间。提前致谢!

2 个解决方案

#1


1  

There are three situations that are very similar but require different degrees of difficulties in terms of exceptions:

有三种情况非常相似,但在例外方面需要不同程度的困难:

Situation 1

If the pattern always jumps back to 1-increase with a couple of interruptions, I would create vector_check which resembles the perfect vector. Everything in input that deviates from this should be set to NA:

如果模式总是在几次中断时跳回到1 - 增加,我会创建类似于完美向量的vector_check。输入中与此不同的所有内容都应设置为NA:

vector_check <- min(input):max(input)
inds         <- vector_check != input
input[inds]  <- NA

Situation 2

If the pattern is less predictable and you basically wish to look for 'irregular' pattern, you'll get a more complicated situation. A possible solution would be to create a while-loop that checks which increments are larger than 2 (or whatever value seems sensible) and then replaces the problematic location bump_inds with an NA. Here I assume that an outlier creates two large increments: one because the value suddenly drops (increases) and one because it rises back up (drops back down) to its old value. This process proceeds until no problematic locations remain:

如果模式不太可预测,并且您基本上希望寻找“不规则”模式,那么您将会遇到更复杂的情况。一种可能的解决方案是创建一个while循环,检查哪个增量大于2(或任何值似乎合理),然后用NA替换有问题的位置bump_inds。在这里,我假设异常值产生两个大的增量:一个因为值突然下降(增加)而另一个因为它上升(下降)到其旧值。此过程继续进行,直到没有问题的位置:

bump_ind <- rep(0, 3)

while(length(bump_ind) > 1){
  bump_ind        <- which( abs(diff(input)) > 2 )
  input[bump_ind[2]] <- NA
}

input
# [1]  1  2  3  4  5 NA  6  7 NA  9 10 11 12

Situation3

A third option, based on your real data sensor shows that the data does not have to jump back to a the previous level:

基于您的真实数据传感器的第三个选项显示数据不必跳回到以前的级别:

input    <- c(20.2,20.2,20.2,20.2,20.1,20.2,20.2,20.1,20.2, 20.2,20.2,20.2,17.7,
              18.9,19.3,19.4,19.4,19.4,19.5,19.5,19.5)
bump_ind <- rep(0, 3)

while(length(bump_ind) > 1){
  bump_ind        <- which( abs(diff(input)) > 2 )
  if(length(bump_ind) > 2){
    bump_ind <- bump_ind[1:2]
  }
  if( length(bump_ind) == 1 ){
      input[bump_ind[1] + 1] <- NA
  } else if( diff(bump_ind > 1) ){
      input[bump_ind[1] + 1] <- NA
  } else{
      input[bump_ind[2]] <- NA
  }
}

input
# [1] 20.2 20.2 20.2 20.2 20.1 20.2 20.2 20.1 20.2 20.2 20.2 20.2   NA 18.9 19.3
# [16] 19.4 19.4 19.4 19.5 19.5 19.5

#2


1  

This may be a more robust solution since you could modify the linear model of your data below if necessary:

这可能是一个更强大的解决方案,因为您可以根据需要修改以下数据的线性模型:

Your data:

 input <- c(1:5, NA, 6:7,0,9:12)

A sequence of numbers:

一系列数字:

x <- seq_len(length(input))

Select some threshold value for the residual of a linear model:

为线性模型的残差选择一些阈值:

threshhold = 2

Calculate the linear model of your data and the residuals and select the outliers:

计算数据和残差的线性模型并选择异常值:

select <- abs((predict(lm(input ~ x), newdata = data.frame(x = x)) -input)) >= threshhold

Replace the outliers with 'NA'

用'NA'替换异常值

input[select] <- NA
input
 [1]  1  2  3  4  5 NA  6  7 NA  9 10 11 12

EDIT: With your data:

编辑:使用您的数据:

input=c(20.2, 20.2, 20.2, 20.2,
        20.1, 20.2, 20.2, 20.1,
        20.2, 20.2, 20.2, 20.2,
        17.7, 18.9, 19.3, 19.4,
        19.4, 19.4, 19.5, 19.5,
        19.5)

x <- seq_len(length(input))
threshhold = 0.7
select <- abs((predict(lm(input ~ x), newdata = data.frame(x = x)) - input)) >= threshhold

inputnew <- input
inputnew[select] <- NA

input
 [1] 20.2 20.2 20.2 20.2 20.1 20.2 20.2 20.1 20.2 20.2 20.2 20.2 17.7 18.9 19.3
 [16] 19.4 19.4 19.4 19.5 19.5 19.5

inputnew
 [1] 20.2 20.2 20.2 20.2 20.1 20.2 20.2 20.1 20.2 20.2 20.2 20.2   NA 18.9 19.3
 [16] 19.4 19.4 19.4 19.5 19.5 19.5

#1


1  

There are three situations that are very similar but require different degrees of difficulties in terms of exceptions:

有三种情况非常相似,但在例外方面需要不同程度的困难:

Situation 1

If the pattern always jumps back to 1-increase with a couple of interruptions, I would create vector_check which resembles the perfect vector. Everything in input that deviates from this should be set to NA:

如果模式总是在几次中断时跳回到1 - 增加,我会创建类似于完美向量的vector_check。输入中与此不同的所有内容都应设置为NA:

vector_check <- min(input):max(input)
inds         <- vector_check != input
input[inds]  <- NA

Situation 2

If the pattern is less predictable and you basically wish to look for 'irregular' pattern, you'll get a more complicated situation. A possible solution would be to create a while-loop that checks which increments are larger than 2 (or whatever value seems sensible) and then replaces the problematic location bump_inds with an NA. Here I assume that an outlier creates two large increments: one because the value suddenly drops (increases) and one because it rises back up (drops back down) to its old value. This process proceeds until no problematic locations remain:

如果模式不太可预测,并且您基本上希望寻找“不规则”模式,那么您将会遇到更复杂的情况。一种可能的解决方案是创建一个while循环,检查哪个增量大于2(或任何值似乎合理),然后用NA替换有问题的位置bump_inds。在这里,我假设异常值产生两个大的增量:一个因为值突然下降(增加)而另一个因为它上升(下降)到其旧值。此过程继续进行,直到没有问题的位置:

bump_ind <- rep(0, 3)

while(length(bump_ind) > 1){
  bump_ind        <- which( abs(diff(input)) > 2 )
  input[bump_ind[2]] <- NA
}

input
# [1]  1  2  3  4  5 NA  6  7 NA  9 10 11 12

Situation3

A third option, based on your real data sensor shows that the data does not have to jump back to a the previous level:

基于您的真实数据传感器的第三个选项显示数据不必跳回到以前的级别:

input    <- c(20.2,20.2,20.2,20.2,20.1,20.2,20.2,20.1,20.2, 20.2,20.2,20.2,17.7,
              18.9,19.3,19.4,19.4,19.4,19.5,19.5,19.5)
bump_ind <- rep(0, 3)

while(length(bump_ind) > 1){
  bump_ind        <- which( abs(diff(input)) > 2 )
  if(length(bump_ind) > 2){
    bump_ind <- bump_ind[1:2]
  }
  if( length(bump_ind) == 1 ){
      input[bump_ind[1] + 1] <- NA
  } else if( diff(bump_ind > 1) ){
      input[bump_ind[1] + 1] <- NA
  } else{
      input[bump_ind[2]] <- NA
  }
}

input
# [1] 20.2 20.2 20.2 20.2 20.1 20.2 20.2 20.1 20.2 20.2 20.2 20.2   NA 18.9 19.3
# [16] 19.4 19.4 19.4 19.5 19.5 19.5

#2


1  

This may be a more robust solution since you could modify the linear model of your data below if necessary:

这可能是一个更强大的解决方案,因为您可以根据需要修改以下数据的线性模型:

Your data:

 input <- c(1:5, NA, 6:7,0,9:12)

A sequence of numbers:

一系列数字:

x <- seq_len(length(input))

Select some threshold value for the residual of a linear model:

为线性模型的残差选择一些阈值:

threshhold = 2

Calculate the linear model of your data and the residuals and select the outliers:

计算数据和残差的线性模型并选择异常值:

select <- abs((predict(lm(input ~ x), newdata = data.frame(x = x)) -input)) >= threshhold

Replace the outliers with 'NA'

用'NA'替换异常值

input[select] <- NA
input
 [1]  1  2  3  4  5 NA  6  7 NA  9 10 11 12

EDIT: With your data:

编辑:使用您的数据:

input=c(20.2, 20.2, 20.2, 20.2,
        20.1, 20.2, 20.2, 20.1,
        20.2, 20.2, 20.2, 20.2,
        17.7, 18.9, 19.3, 19.4,
        19.4, 19.4, 19.5, 19.5,
        19.5)

x <- seq_len(length(input))
threshhold = 0.7
select <- abs((predict(lm(input ~ x), newdata = data.frame(x = x)) - input)) >= threshhold

inputnew <- input
inputnew[select] <- NA

input
 [1] 20.2 20.2 20.2 20.2 20.1 20.2 20.2 20.1 20.2 20.2 20.2 20.2 17.7 18.9 19.3
 [16] 19.4 19.4 19.4 19.5 19.5 19.5

inputnew
 [1] 20.2 20.2 20.2 20.2 20.1 20.2 20.2 20.1 20.2 20.2 20.2 20.2   NA 18.9 19.3
 [16] 19.4 19.4 19.4 19.5 19.5 19.5