Time Series Analysis Part 5 – Oxford Temperature

Over the past 4 articles the focus was on applying time series techniques to generated time series data. The last article walked through a build up of a multiple linear regression model while incorporating time lags and trend as features. This resulted in a model with MSE comparable to the SARIMA models we’ve been working with.

Here, for the first time in this series of articles we’ll be working with real life data. In particular the monthly maximum temperature from the Oxford Temperature Data Since 1853 (from the Met Office).

This is what this data looks like:

It’s quite difficult to see seasonality here or anything we can relate to. This is because we are visualising too much information. Here’s the same data but for the last 5 years:

There is probably a 12 month seasonality here. Over the past 5 years it looks like the max temperature has been ranging from about 7 to 25 degrees.

The data has a total of 2032 data points which we will split into Train (1828) and Test (204).

SARIMA

If we look at the ACF plot (autocorrelogram) we can see a seasonality of 12 months:

Differencing the data, removing seasonality and plotting the ACF and PACF we see evidence of ARMA(1,2)

The seasonality component has an ARMA(1,3) from the seasonality ACF and PACF plots below

The ACF and the PACF of the seasonality component. The ACF here is telling us that the first seasonality lag (lag 12) and the kth seasonality lag (lag 12*k) are not correlated for k > 3 (or k > 4). The PACF is telling us that once you’ve controlled for the first seasonality lag, the second one doesn’t provide much information.

Performing a stepwise search of the SARIMA parameters that have the best out of bag MSE on a hold out set starting with SARIMA(1,0,3)(1,0,3,12) we arrive at the best performing model of SARIMA(3,1,1)(1,0,1,12). Note that the hold out set here is not the Test set of 204 data points we set aside at the beginning:

A stepwise search minimising the MSE on a hold out set of 100 datapoints

This model has an average MSE of 2.8 on 10-fold time series cross validations (same approach as seen in Time Series Analysis Part 3 – Assessing Model Fit). For each fold the model is asked to predict on data it has not seen before (i.e. the original test set we set aside). The fit on the Test set looks like this:

Linear Regression

Here we loop through different numbers of lags and seasonal components in a regression and identify the features that results in the lowest MSE on a hold-out set. After doing this we find that the optimal parameters for a regression is a lag of 2 and seasonal lag 1: