Python time series analysis

    Good afternoon, dear readers.
    In today's article, I will try to describe the process of analyzing time series using python and the statsmodels module . This module provides a wide range of tools and methods for conducting statistical analysis and econometrics. I will try to show the main stages of the analysis of such series; in conclusion, we will build the ARIMA model .
    As an example, real data on the turnover of one of the warehouse complexes of the Moscow Region are taken.

    Download and preprocess data


    To get started, upload the data and look at it:

    from pandas import read_csv, DataFrame
    import statsmodels.api as sm
    from statsmodels.iolib.table import SimpleTable
    from sklearn.metrics import r2_score
    import ml_metrics as metrics
    In [2]:
    dataset = read_csv('tovar_moving.csv',';', index_col=['date_oper'], parse_dates=['date_oper'], dayfirst=True)
    dataset.head()
    

    Otgruzkapriemka
    date_oper
    2009-09-01 179667 276712
    2009-09-02 177670 164999
    2009-09-03 152112 189181
    2009-09-04 142938 254581
    2009-09-05 130741 192486


    So, as you can see the read_csv () function , in this case, in addition to specifying the parameters that specify the columns used and the index, you can also notice 3 more parameters for working with the date. Let us dwell on them in more detail.
    parse_dates specifies the column names to be converted to the DateTime type . It is worth noting that if this column contains empty values, parsing will fail and a column of type object will be returned . To avoid this, add the keep_default_na = False parameter .
    Final dayfirst parameterindicates the parsing function that the first in the line goes first the day, and not vice versa. If this parameter is not set, the function may not correctly convert the dates and confuse the month and day in places. For example, 02/01/2013 will be converted to 02-01-2013 , which will be incorrect.
    We select a separate series of time series with the values ​​of shipments:

    otg = dataset.Otgruzka
    otg.head()
    

    date_oper
    2009-09-01179667
    2009-09-02177670
    2009-09-03152112
    2009-09-04142938
    2009-09-05130741
    Name: Otgruzka, dtype: int64

    So we now have a time series and we can proceed to its analysis.

    Time series analysis


    To begin, let's sort the graph of our series:

    otg.plot(figsize=(12,6))
    


    It can be seen from the graph that our series has a small number of emissions that affect the scatter. In addition, it is not entirely correct to analyze shipments for each day, because, for example, at the end or beginning of the week there will be days on which goods are shipped much more than the rest. Therefore, it makes sense to go to the weekly interval and the average value of shipments on it, this will save us from emissions and reduce fluctuations in our range. In pandas there is a convenient function resample () for this , as a parameter, it passes the rounding period and the aggregate function:

    otg = otg.resample('W', how='mean')
    otg.plot(figsize=(12,6))
    


    As you can see, the new chart has no bright emissions and has a pronounced trend. From this we can conclude that the series is not stationary [1] .

    itog = otg.describe()
    otg.hist()
    itog
    

    count225
    mean270858.285365
    std118371.082975
    min872.857143
    25%180263.428571
    fifty%277898.714286
    75%355587.285714
    max552485.142857
    dtype: float64



    As can be seen from the characteristics and the histogram, our series are less homogeneous and have a relatively small spread, as evidenced by the coefficient of variation :, where is the standard deviation , is the arithmetic mean of the sample. In our case, it is equal to:

    print 'V = %f' % (itog['std']/itog['mean'])
    

    V = 0.437022

    We carry out the Harkey - Behr test to determine the distribution nominality to confirm the homogeneity assumption. To do this, there is a function jarque_bera () , which returns the values ​​of this statistic:

    row =  [u'JB', u'p-value', u'skew', u'kurtosis']
    jb_test = sm.stats.stattools.jarque_bera(otg)
    a = np.vstack([jb_test])
    itog = SimpleTable(a, row)
    print itog
    



    The value of this statistic indicates that the null hypothesis of the normality of the distribution is rejected with a low probability ( probably> 0.05 ), and therefore our series has a normal distribution.
    The SimpleTable () function is used to format the output. In our case, it receives an array of values ​​(dimension no more than 2) and a list with the names of columns or rows.
    Many methods and models are based on assumptions about the stationarity of a series, but as was noted earlier, our series is most likely not such. Therefore, to check the stationarity check, let us conduct a generalized Dickey-Fuller test for the presence of unit roots. To do this, the statsmodels module has an adfuller () function:

    test = sm.tsa.adfuller(otg)
    print 'adf: ', test[0] 
    print 'p-value: ', test[1]
    print'Critical values: ', test[4]
    if test[0]> test[4]['5%']: 
        print 'есть единичные корни, ряд не стационарен'
    else:
        print 'единичных корней нет, ряд стационарен'
    

    adf: -1.38835541357
    p-value: 0.58784577297
    Critical values: {'5%': -2.8753374677799957, '1%': -3.4617274344627398, '10% ': -2.5741240890815571}
    there are single roots, a number is not stationary The


    test did not confirm the assumptions stationarity row. In many cases, taking the difference of the series allows you to do this. If, for example, the first differences of the series are stationary, then it is called an integrated first-order series .
    So, let's define the order of the integrated series for our series:

    otg1diff = otg.diff(periods=1).dropna()
    

    In the code above, the diff () function calculates the difference in the original series next to the specified period offset. The offset period is passed as the parameter period . Because since the first value is ambiguous, then we need to get rid of it for this and the dropna () method is used.
    Let's check the resulting series for stationarity:

    test = sm.tsa.adfuller(otg1diff)
    print 'adf: ', test[0]
    print 'p-value: ', test[1]
    print'Critical values: ', test[4]
    if test[0]> test[4]['5%']: 
        print 'есть единичные корни, ряд не стационарен'
    else:
        print 'единичных корней нет, ряд стационарен'
    

    adf: -5.95204224907
    p-value: 2.13583392404e-07
    Critical values: {'5%': -2.8755379867788462, '1%': -3.4621857592784546, '10% ': -2.574231080806213}
    there are no single roots, the number is stationary

    As you can see from the code above, the resulting series of first differences approached the stationary one. To be sure, we’ll break it into several gaps and make sure mat. expectations at different intervals:

    m = otg1diff.index[len(otg1diff.index)/2+1]
    r1 = sm.stats.DescrStatsW(otg1diff[m:])
    r2 = sm.stats.DescrStatsW(otg1diff[:m])
    print 'p-value: ', sm.stats.CompareMeans(r1,r2).ttest_ind()[1]
    

    p-value: 0.693072039563 A

    high p-value allows us to assert that the null hypothesis of equality of means is true, which indicates the stationarity of the series. It remains to be sure that there is no trend for this, we will build a graph of our new series:

    otg1diff.plot(figsize=(12,6))


    The trend is indeed absent, so the series of the first differences is stationary, and our initial series is an integrated series of the first order .

    Building a time series model


    For modeling we will use the ARIMA model built for a number of first differences.
    So, to build a model we need to know its order, consisting of 2 parameters:
    1. p is the order of the component AR
    2. d is the order of the integrated series
    3. q is the order of the component MA


    The parameter d is and it breaks 1, it remains to determine p and q . To determine them, we need to study the autocorrelation (ACF) and partially autocorrelation (PACF) functions for a number of first differences.
    ACF will help us determine q , because its correlogram can determine the number of autocorrelation coefficients that are very different from 0 in the MA model.
    PACF will help us determine p , because its correlogram can determine the maximum coefficient number that is very different from 0 in the AR model. .
    To build the corresponding correlograms, the following functions are available in the statsmodels package:plot_acf () and plot_pacf () . They display ACF and PACF graphs , in which the lag numbers are plotted along the X axis, and the values ​​of the corresponding functions along the Y axis. It should be noted that the number of lags in the functions determines the number of significant coefficients. So, our functions look like this:

    ig = plt.figure(figsize=(12,8))
    ax1 = fig.add_subplot(211)
    fig = sm.graphics.tsa.plot_acf(otg1diff.values.squeeze(), lags=25, ax=ax1)
    ax2 = fig.add_subplot(212)
    fig = sm.graphics.tsa.plot_pacf(otg1diff, lags=25, ax=ax2)


    After studying the PACF correlogram, we can conclude that p = 1 , because only 1 lag on it is very different from zero. From the ACF correlogram, we can see that q = 1 , because After lag 1, the value of the functions drops sharply.
    So, when all the parameters are known, it is possible to build a model, but for its construction we will take not all the data, but only a part. We will leave the data from the part that did not fall into the model to verify the accuracy of the forecast of our model:

    src_data_model = otg[:'2013-05-26']
    model = sm.tsa.ARIMA(src_data_model, order=(1,1,1), freq='W').fit(full_output=False, disp=0)

    The trend parameter is responsible for the presence of a constant in the model. We derive information on the resulting model:

    print model.summary()



    As can be seen from this information in our model, all the coefficients are significant and we can proceed to the assessment of the model.

    Analysis and assessment of the model


    We will check the remnants of this model for compliance with the “white noise” , and also analyze the correlogram of the residuals, as this can help us in determining the regression elements that are important for including and predicting.
    So the first thing we will do is conduct the Lung – Box Q-test to test the hypothesis that the residues are random, that is, they are “white noise”. This test is carried out on the remains of the ARIMA model . Thus, we first need to get the remnants of the model and build ACF for them, and then take the test to the resulting coefficients. With statsmadels, this can be done like this:

    q_test = sm.tsa.stattools.acf(model.resid, qstat=True) #свойство resid, хранит остатки модели, qstat=True, означает что применяем указынный тест к коэф-ам
    print DataFrame({'Q-stat':q_test[1], 'p-value':q_test[2]})

    Result
    Q-statp-value
    0 0.5314260.466008
    1 3.0732170.215109
    2 3.6442290.302532
    3 3.9063260.418832
    4 4.7014330.453393
    5 5.4337450.489500
    6 5.4442540.605916
    7 5.4453090.709091
    8 5.9007620.749808
    9 6.0049280.814849
    106.1559660.862758
    eleven6.2999580.900213
    1212.7315420.468755
    thirteen14.7078940.398410
    1420.7206070.145996
    fifteen23.1974330.108558
    1623.9498010.120805
    1724.1192360.151160
    1825.6161840.141243
    1926.0351650.164654
    2028.9698800.114727
    2128.9736600.145614
    2229.0177160.179723
    2332.1140060.124191
    2432.2848050.149936
    2533.1233950.158548
    2633.1290590.192844
    2733.7604880.208870
    2838.4210530.113255
    29th38.7242260.132028
    thirty38.9734260.153863
    3138.9781720.184613
    3239.3189540.207819
    3339.3824720.241623
    3439.4237630.278615
    3540.0836890.293860
    3643.8495150.203755
    3745.7044760.182576
    3847.1329110.174117
    3947.3653050.197305


    The significance of these statistics and p-values ​​indicate that the hypothesis of randomness of residues is not rejected, and most likely this process represents “white noise”.
    Now let's calculate the coefficient of determination to understand what percentage of observations this model describes:

    pred = model.predict('2013-05-26','2014-12-31', typ='levels')
    trn = otg['2013-05-26':]
    r2 = r2_score(trn, pred[1:32])
    print 'R^2: %1.2f' % r2

    R ^ 2: -0.03

    standard deviation [2] of our model:

    metrics.rmse(trn,pred[1:32])

    80919.057367642512

    The average absolute error [2] of the forecast:

    metrics.mae(trn,pred[1:32])

    63092.763277651895 It

    remains to draw our forecast on the chart:

    otg.plot(figsize=(12,6))
    pred.plot(style='r--')



    Conclusion


    As you can see from the graph, our model is not making a very good forecast. This is partly due to outliers in the source data, which we have not completely removed, as well as to the ARIMA module of the statsmodels package, because it is quite new. The article is more aimed at showing how exactly one can analyze time series in python. I would also like to note that in the package examined today various methods of regression analysis are very fully implemented (I will try to show in further articles).
    In general, for small studies, the statsmodels package is fully suitable, but for serious scientific work it’s still raw and some tests and statistics are missing.

    References


    1. I.I. Eliseeva. Econometrics
    2. Comparison of time series models

    Also popular now: