Time Series Analysis Part 5 – Oxford Temperature
Over the past 4 articles the focus was on applying time series techniques to generated time series data. The last article walked through a build up of a multiple linear regression model while incorporating time lags and trend as features. This resulted in a model with MSE comparable to the SARIMA models we’ve been working with.
Here, for the first time in this series of articles we’ll be working with real life data. In particular the monthly maximum temperature from the Oxford Temperature Data Since 1853 (from the Met Office).
This is what this data looks like:
It’s quite difficult to see seasonality here or anything we can relate to. This is because we are visualising too much information. Here’s the same data but for the last 5 years:
There is probably a 12 month seasonality here. Over the past 5 years it looks like the max temperature has been ranging from about 7 to 25 degrees.
The data has a total of 2032 data points which we will split into Train (1828) and Test (204).
SARIMA
If we look at the ACF plot (autocorrelogram) we can see a seasonality of 12 months:
Differencing the data, removing seasonality and plotting the ACF and PACF we see evidence of ARMA(1,2)
The seasonality component has an ARMA(1,3) from the seasonality ACF and PACF plots below
Performing a stepwise search of the SARIMA parameters that have the best out of bag MSE on a hold out set starting with SARIMA(1,0,3)(1,0,3,12) we arrive at the best performing model of SARIMA(3,1,1)(1,0,1,12). Note that the hold out set here is not the Test set of 204 data points we set aside at the beginning:
This model has an average MSE of 2.8 on 10-fold time series cross validations (same approach as seen in Time Series Analysis Part 3 – Assessing Model Fit). For each fold the model is asked to predict on data it has not seen before (i.e. the original test set we set aside). The fit on the Test set looks like this:
Linear Regression
Here we loop through different numbers of lags and seasonal components in a regression and identify the features that results in the lowest MSE on a hold-out set. After doing this we find that the optimal parameters for a regression is a lag of 2 and seasonal lag 1:
The regression output is shown below (only coefficients that are significant are retained to improve generalisability):
The fit on the test set of this model doesn’t look too bad:
However, the time series 10-fold cross-validated average MSE over the test set is 4.3. This is a little higher than the 2.8 obtained using SARIMA.
Next up, we continue the journey by entering the realm of Neural Networks…