The problem with Backtesting

April 25, 2020 Risk

Backtesting is the magical technique that tells us how well a forecast model works. Test the model on history, and we have an objective way to evaluate how good the model is. But does it really work in practice?

How well do riskometers perform?

Suppose you are a junior analyst in a bank and you go to your boss with your new fancy riskometer.

Boss: "Hey Junior analyst, how good is your riskometer?"

You: "Hey, boss. Wait five years, we will observe how it works operationally and then we will know."

I think the next thing you will see is a pink slip.

So instead, we resort to backtesting. If you want a more technical discourse, hop over to my lecture slides on backtesting from my book Financial risk forecasting.

Imagine you have daily data on a stock from 1 January 2000, and you are running a riskometer that needs 1000 observations or four years of data. You then use the 1000 days from 1 January 2000 until 31 December 31 2003 to forecast risk for 1 January 2004. You can now compare how accurate the forecast is to actually what happened that day.

You then move up by one day, to forecast risk for 2 January 2004, and repeat until today, giving you 4330 days of risk forecasts and prices.

That gives you a large sample to test how good your riskometer is. Perhaps use Peter Cristoffersen's coverage test if you are using Value-at-risk, and simply count the number of violations and compare them to what you expect. If you don't like the coverage test, there are many others to choose from.

If doing 99% VaR, 4330 observations mean expected 43 violations, plenty for a robust test.

So all is good? No. There are two problems:

Backtests look at how a riskometer forecasts history when the operator knows the history and has a vested interest in either passing or rejecting a riskometer.
We can backtest a riskometer relative to a benchmark. However, we have to assume the benchmark, and it has to be relevant for our objective.

The first problem is that anybody who designs a riskometer knows what happened. If I'm testing it over the 4330 days since 1 January 2004, I know that 2004, 2005 and 2006 were low volatility — low risk periods, and a financial crisis hit in autumn of 2008. I can tailor pick a riskometer that will perform particularly well over this testing period. I don't have to run it to find out, any experienced designer of riskometers will just know which to pick.

And that automatically biases the backtest.

It's a problem that would be familiar to the historians of the Soviet Union, who become experts in forecasting history because history justified decisions and the survival of the historian depended on how well their history supported the Kremlin.

The second problem is more complex. We have to test the riskometer relative to a benchmark, but what is to say the benchmark is relevant to us? If it is a coverage test, we count violations, but maybe what matters more is the clustering of violations or magnitude or something else.

Ultimately, backtests aren't scientific as I discussed recently

That doesn't mean they aren't useful, just take them with a grain of salt, and please don't base your market risk regulations on them.

Models and risk | Financial Regulation, Systemic Risk, Stability and AI