Python - 如何规范化时间序列数据

时间:2021-10-11 16:55:21

I have a dataset of time-series examples. I want to calculate the similarity between various time-series examples, however I do not want to take into account differences due to scaling (i.e. I want to look at similarities in the shape of the time-series, not their absolute value). So, to this end, I need a way of normalizing the data. That is, making all of the time-series examples fall between a certain region e.g [0,100]. Can anyone tell me how this can be done in python

我有一个时间序列示例的数据集。我想计算各种时间序列示例之间的相似性,但是我不想考虑由于缩放引起的差异(即我想看看时间序列形状的相似性,而不是它们的绝对值)。因此,为此,我需要一种规范化数据的方法。也就是说,使所有时间序列示例落在某个区域之间,例如[0,100]。任何人都可以告诉我如何在python中完成此操作

4 个解决方案

#1


7  

Assuming that your timeseries is an array, try something like this:

假设你的时间序列是一个数组,尝试这样的事情:

(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())

This will confine your values between 0 and 1

这会将您的值限制在0和1之间

#2


4  

The solutions given are good for a series that aren’t incremental nor decremental(stationary). In financial time series( or any other series with a a bias) the formula given is not right. It should, first be detrended or perform a scaling based in the latest 100-200 samples.
And if the time series doesn't come from a normal distribution ( as is the case in finance) there is advisable to apply a non linear function ( a standard CDF funtion for example) to compress the outliers.
Aronson and Masters book (Statistically sound Machine Learning for algorithmic trading) uses the following formula ( on 200 day chunks ):

给出的解决方案适用于非递增或递减(静止)的系列。在金融时间序列(或任何其他具有偏见的系列)中,给出的公式是不正确的。它应首先取消或基于最新的100-200个样本进行缩放。如果时间序列不是来自正态分布(如财务中的情况),则建议应用非线性函数(例如标准CDF函数)来压缩异常值。 Aronson和Masters的书(用于算法交易的统计声音机器学习)使用以下公式(在200天的块上):

V = 100 * N ( 0.5( X -F50)/(F75-F25)) -50

V = 100×N(0.5(X-F50)/(F75-F25)) - 50

Where:
X : data point
F50 : mean of the latest 200 points
F75 : percentile 75
F25 : Percentile 25
N : normal CDF

其中:X:数据点F50:最新200点的平均值F75:百分位数75 F25:百分位数25 N:正常CDF

#3


2  

I'm not going to give the Python code, but the definition of normalizing, is that for every value (datapoint) you calculate "(value-mean)/stdev". Your values will not fall between 0 and 1 (or 0 and 100) but I don't think that's what you want. You want to compare the variation. Which is what you are left with if you do this.

我不打算提供Python代码,但规范化的定义是针对您计算的每个值(数据点)“(value-mean)/ stdev”。你的价值不会介于0和1(或0和100)之间,但我认为这不是你想要的。您想要比较变化。如果你这样做,那就是你剩下的。

#4


2  

Following my previous comment, here it is a (not optimized) python function that does scaling and/or normalization: ( it needs a pandas DataFrame as input, and it’s doesn’t check that, so it raises errors if supplied with another object type. If you need to use a list or numpy.array you need to modify it. But you could convert those objects to pandas.DataFrame() first.

按照我之前的评论,这里是一个(未优化的)python函数,它进行缩放和/或规范化:(它需要一个pandas DataFrame作为输入,它不会检查它,所以如果提供另一个对象类型它会引发错误如果你需要使用list或numpy.array,你需要修改它。但你可以先将这些对象转换为pandas.DataFrame()。

This function is slow, so it’s advisable run it just once and store the results.

此功能很慢,因此建议只运行一次并存储结果。

    from scipy.stats import norm
    import pandas as pd

    def get_NormArray(df, n, mode = 'total', linear = False):
        '''
                 It computes the normalized value on the stats of n values ( Modes: total or scale ) 
                 using the formulas from the book "Statistically sound machine learning..."
                 (Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
                 It is modified to fit the data from -1 to 1 instead of -100 to 100
                 df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
                 n define the number of data points to get the mean and the quartiles for the normalization
                 modes: scale: scale, without centering. total:  center and scale.
         '''
        temp =[]

        for i in range(len(df))[::-1]:

            if i  >= n: # there will be a traveling norm until we reach the initian n values. 
                        # those values will be normalized using the last computed values of F50,F75 and F25
                F50 = df[i-n:i].quantile(0.5)
                F75 =  df[i-n:i].quantile(0.75)
                F25 =  df[i-n:i].quantile(0.25)

            if linear == True and mode == 'total':
                 v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5
            elif linear == True and mode == 'scale':
                 v =  0.25 * df.iloc[i]/(F75-F25) -0.5
            elif linear == False and mode == 'scale':
                 v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5

            else: # even if strange values are given, it will perform full normalization with compression as default
                v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5

            temp.append(v[0])
        return  pd.DataFrame(temp[::-1])

#1


7  

Assuming that your timeseries is an array, try something like this:

假设你的时间序列是一个数组,尝试这样的事情:

(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())

This will confine your values between 0 and 1

这会将您的值限制在0和1之间

#2


4  

The solutions given are good for a series that aren’t incremental nor decremental(stationary). In financial time series( or any other series with a a bias) the formula given is not right. It should, first be detrended or perform a scaling based in the latest 100-200 samples.
And if the time series doesn't come from a normal distribution ( as is the case in finance) there is advisable to apply a non linear function ( a standard CDF funtion for example) to compress the outliers.
Aronson and Masters book (Statistically sound Machine Learning for algorithmic trading) uses the following formula ( on 200 day chunks ):

给出的解决方案适用于非递增或递减(静止)的系列。在金融时间序列(或任何其他具有偏见的系列)中,给出的公式是不正确的。它应首先取消或基于最新的100-200个样本进行缩放。如果时间序列不是来自正态分布(如财务中的情况),则建议应用非线性函数(例如标准CDF函数)来压缩异常值。 Aronson和Masters的书(用于算法交易的统计声音机器学习)使用以下公式(在200天的块上):

V = 100 * N ( 0.5( X -F50)/(F75-F25)) -50

V = 100×N(0.5(X-F50)/(F75-F25)) - 50

Where:
X : data point
F50 : mean of the latest 200 points
F75 : percentile 75
F25 : Percentile 25
N : normal CDF

其中:X:数据点F50:最新200点的平均值F75:百分位数75 F25:百分位数25 N:正常CDF

#3


2  

I'm not going to give the Python code, but the definition of normalizing, is that for every value (datapoint) you calculate "(value-mean)/stdev". Your values will not fall between 0 and 1 (or 0 and 100) but I don't think that's what you want. You want to compare the variation. Which is what you are left with if you do this.

我不打算提供Python代码,但规范化的定义是针对您计算的每个值(数据点)“(value-mean)/ stdev”。你的价值不会介于0和1(或0和100)之间,但我认为这不是你想要的。您想要比较变化。如果你这样做,那就是你剩下的。

#4


2  

Following my previous comment, here it is a (not optimized) python function that does scaling and/or normalization: ( it needs a pandas DataFrame as input, and it’s doesn’t check that, so it raises errors if supplied with another object type. If you need to use a list or numpy.array you need to modify it. But you could convert those objects to pandas.DataFrame() first.

按照我之前的评论,这里是一个(未优化的)python函数,它进行缩放和/或规范化:(它需要一个pandas DataFrame作为输入,它不会检查它,所以如果提供另一个对象类型它会引发错误如果你需要使用list或numpy.array,你需要修改它。但你可以先将这些对象转换为pandas.DataFrame()。

This function is slow, so it’s advisable run it just once and store the results.

此功能很慢,因此建议只运行一次并存储结果。

    from scipy.stats import norm
    import pandas as pd

    def get_NormArray(df, n, mode = 'total', linear = False):
        '''
                 It computes the normalized value on the stats of n values ( Modes: total or scale ) 
                 using the formulas from the book "Statistically sound machine learning..."
                 (Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
                 It is modified to fit the data from -1 to 1 instead of -100 to 100
                 df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
                 n define the number of data points to get the mean and the quartiles for the normalization
                 modes: scale: scale, without centering. total:  center and scale.
         '''
        temp =[]

        for i in range(len(df))[::-1]:

            if i  >= n: # there will be a traveling norm until we reach the initian n values. 
                        # those values will be normalized using the last computed values of F50,F75 and F25
                F50 = df[i-n:i].quantile(0.5)
                F75 =  df[i-n:i].quantile(0.75)
                F25 =  df[i-n:i].quantile(0.25)

            if linear == True and mode == 'total':
                 v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5
            elif linear == True and mode == 'scale':
                 v =  0.25 * df.iloc[i]/(F75-F25) -0.5
            elif linear == False and mode == 'scale':
                 v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5

            else: # even if strange values are given, it will perform full normalization with compression as default
                v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5

            temp.append(v[0])
        return  pd.DataFrame(temp[::-1])