Residual Analysis and Diagnostic Tests
Residual analysis is how you check whether your time series model is actually doing its job. After fitting a model, the residuals (the differences between what you observed and what the model predicted) should look like random noise. If they don't, your model is missing something. This section covers the main plots and statistical tests you'll use to diagnose problems and decide how to fix them.
Importance of Residual Analysis
A well-specified time series model should extract all the systematic patterns from the data, leaving behind only unpredictable noise. Residual analysis tests whether that's actually happened.
- Residuals are the differences between observed values and model-fitted values:
- If the model is good, residuals should behave like white noise: no patterns, no autocorrelation, roughly constant variance, and approximately normal distribution
- Residual analysis also helps you spot outliers or influential observations that may be distorting your model's estimates
Think of it this way: if you can still find a pattern in the residuals, that pattern is information your model failed to capture, and you should go back and improve the model.

Interpretation of Residual Plots
Residual plots are your first line of defense. Before running any formal tests, visually inspect these graphs:
- Residuals vs. fitted values: checks for heteroscedasticity (non-constant variance) and non-linearity
- Ideal: residuals scattered randomly around zero with no fan shape or curve
- Residuals vs. time: checks for trends, seasonal patterns, or shifts in variance over time
- Ideal: residuals fluctuate randomly around zero with no systematic drift
- ACF and PACF of residuals: checks for leftover autocorrelation the model didn't capture
- Ideal: nearly all spikes fall within the 95% confidence bands
When you spot a problem in these plots, it points toward a specific fix:
| Pattern You See | What It Suggests | Possible Fix |
|---|---|---|
| Fan or funnel shape in residuals vs. fitted | Non-constant variance (heteroscedasticity) | Variance-stabilizing transformation (e.g., log) or a GARCH-type model |
| Curved pattern in residuals vs. fitted | Non-linearity | Add non-linear terms or apply a transformation |
| Trend or drift in residuals vs. time | Non-stationarity not fully addressed | Additional differencing or trend terms |
| Significant spikes in ACF/PACF | Residual autocorrelation | Add AR or MA terms to the model |

Diagnostic Tests for Autocorrelation
Visual inspection of ACF/PACF plots is useful but subjective. Formal tests give you a more rigorous answer about whether residual autocorrelation is present.
Ljung-Box Test
This is the most commonly used test for residual autocorrelation in time series. It checks multiple lags simultaneously rather than testing one lag at a time.
- Null hypothesis (): Residuals are independently distributed (no autocorrelation up to lag )
- Alternative hypothesis (): Residuals exhibit autocorrelation at one or more lags
The test statistic is:
where is the sample size, is the number of lags tested, and is the sample autocorrelation of residuals at lag .
How to use it:
-
Choose the number of lags to test (a common rule of thumb is for non-seasonal data, or where is the seasonal period)
-
Compute the statistic from the residual autocorrelations
-
Compare to the critical value from a chi-square distribution with degrees of freedom, where and are the number of AR and MA parameters in your model
-
If exceeds the critical value (or equivalently, the p-value is below your significance level), reject and conclude that significant autocorrelation remains
Note on degrees of freedom: Some textbooks use degrees of freedom directly, but when testing residuals from an ARMA(p,q) model, you should subtract the number of estimated parameters () to account for the fact that the residuals aren't truly independent observations.
Other autocorrelation tests:
- Durbin-Watson test: tests specifically for first-order (lag-1) autocorrelation; limited because it only checks one lag
- Breusch-Godfrey test: more flexible than Durbin-Watson, can test for autocorrelation at multiple lags and works even when lagged dependent variables are present
If any of these tests detect significant autocorrelation, your model is misspecified. The typical remedy is adding AR or MA terms to capture the remaining dependence.
Assessment of Residual Normality
Many time series inference procedures (confidence intervals, prediction intervals, hypothesis tests on coefficients) assume the errors are normally distributed. Residual normality checks tell you whether those results are trustworthy.
Start with a visual check: a histogram of residuals and a Q-Q plot (quantile-quantile plot comparing residual quantiles to theoretical normal quantiles). If the Q-Q plot shows points roughly along a straight line, normality is reasonable.
Jarque-Bera Test
This test checks normality by looking at whether the residuals have the skewness and kurtosis you'd expect from a normal distribution (skewness = 0, kurtosis = 3).
- Null hypothesis (): Residuals are normally distributed
- Alternative hypothesis (): Residuals are not normally distributed
The test statistic is:
where is the sample size, is the sample skewness, and is the sample kurtosis. Under , this follows a chi-square distribution with 2 degrees of freedom.
A large value (small p-value) means the residuals deviate significantly from normality, either through asymmetry (skewness) or heavy/light tails (kurtosis).
Other normality tests include the Shapiro-Wilk test (generally more powerful for small samples) and the Anderson-Darling test.
If normality is violated, you have a few options:
- Transform the data (e.g., log or Box-Cox transformation) before fitting the model, which often stabilizes variance and improves normality simultaneously
- Use robust estimation methods that are less sensitive to non-normal errors
- Use bootstrap or distribution-free methods for inference instead of relying on normal-theory confidence intervals
Keep in mind that mild departures from normality are usually not a serious problem, especially with larger samples, because many estimators are still consistent. Severe skewness or heavy tails are more concerning and worth addressing.