After completing this reading you should be able to:
 Define backtesting and exceptions and explain the importance of backtestingVaR models.
 Explain the significant difficulties in backtesting a VaR model.
 Verify a model based on exceptions or failure rates.
 Define and identify Type I and Type II errors.
 Explain the need to consider conditional coverage in the backtesting framework.
 Describe the Basel rules for backtesting.
In this chapter, the accuracy of VaR models is verified by backtesting techniques. Backtesting is a formal statistical framework that verifies that actual losses are in line with the projected losses. This is achieved by systematically comparing the history of VaR forecasts with their associated portfolio returns. VaR risk managers and users find these procedures, also called reality checks, very essential when checking that their VaR forecasts are well calibrated.
What is Backtesting?
Backtesting is the process of comparing losses predicted by a value at risk (VaR) model to those actually experienced over the testing period. It is done to ensure that VaR models are reasonably accurate. Risk managers systematically check the validity of the underlying valuation and risk models by comparing actual to predicted levels of losses.
The overall goal of backtesting is to ensure that actual losses do not exceed the expected losses at a given level of confidence. Exceptions are number of actual observations over and above the expected level. In the context of the VaR, the number of exceptions falling outside of the VaRconfidence level should not exceed one minus the confidence level. For instance, exceptions should occur less than 1% of the time if the level of confidence is 99%. Exceptions are also called exceedances.
For a model that is perfectly calibrated, the number of observed exceptions should be approximately the same as the VaR significance level.
 When too many exceptions are observed, the model involved is not calibrated properly and actually underestimates risk. The regulator may respond by imposing fines and/or additional capital requirements so as to shield risktaking units from possible financial strain.
 When too few exceptions are observed, this again can be a big problem. It would normally imply that the bank is not allocating funds to risktaking units in an efficient manner. Such a scenario will put managers on a collision path with shareholders/investors.
Examples
 Over 100 days, a good 95.0% VaR model will produce approximately 5.0% * 100 days = 5 exceptions
 Over 1,000 days, a good 99.0% VaR model will produce approximately 1.0% * 1,000 days = 10 exceptions
 Assuming there are 252 trading days in a year, a 95.0% daily VaR should be exceeded about 13 days per year; 5% * 252 days = 12.6 days
 Assuming there are 252 trading days in a year, a 99.0% daily VaR should be exceeded about 10 days over a fouryear period; 1% * 4 * 252 days = 10.08 days
Backtesting is also important for the following reasons:
 Backtesting gives a “reality check” on whether VaR forecasts are properly calibrated or accurate. Too many exceptions should prompt a recalibration of the model and a thorough reexamining of assumptions, parameters, and the entire modeling process.
 Backtesting provides the Basel Committee with a critical evaluation technique to test the adequacy of internal VaR models.
 Backtesting helps bank regulators to verify risk models used by subject banks and identify banks using models that may potentially underestimate risk, thereby endangering the financial health of not just the bank but the industry in general (This is particularly true for systematically important banks, SIBs).
It is not uncommon to find banks with excessive exceptions (more than four exceptions in a sample size of 250) being penalized with higher capital requirements.
Difficulties in Backtesting a VaR Model
There are several things that make backtesting a difficult task for risk managers.
First, VaR models are based on static portfolios but in reality, actual portfolio compositions are in a constant stage of change to reflect daily gains/losses, expenses, and buy/sell decisions. For this reason, the risk manager should track both the actual portfolio return and the hypothetical (static) return. In some instances, it may also make sense to carry out backtesting using a “clean return” instead of the actual return. Clean return is actual return minus all nonmarktomarket items like fees, commissions, and net income.
Second, the sample backtested may not be representative of the true underlying risk. Since the backtesting period is just a limited sample, it would be a stretch of reality to expect the predicted number of exceptions in every sample. At the end of the day, backtesting remains a statistical process with accept/reject decisions.
Verifying a model Based on Exceptions or Failure Rates
For a model to be completely accurate, the number of exceptions would have to be the same as the VaR significance level, where significance is one minus the confidence level. We have already established that the backtesting period constitutes a limited sample at a specific confidence level, which means it would be unrealistic to expect to find the modelpredicted number of exceptions in every sample. In other words, there are instances where the observed number of exceptions will not be the same as that predicted by the model, but that does not necessarily mean that the model is flawed. As such, we must establish the level (point) at which we reject the model.
We verify a model by recording the failure rate which represents the proportion of times VaR is exceeded in a given sample.Under the null hypothesis of a correctly calibrated model (Null \(\text H_0\): correct model), the number of exceptions (x) follows a binomial probability distribution:
$$ \text f(\text x) ={^{\text T}} \text C_{\text x} {\text P}^{\text x} (1\text P)^{\text T\text x} $$
Where T is the sample size and p is the probability of exception (p = 1 – confidence level).
The expected value of (x) is p*T and a variance, \(\sigma^2(x) = \text p*(1\text p)*{\text T}\)
The inherent assumption here is that exceptions (failures) are independent and identically distributed (i.i.d.) random variables.
If we use N to represent the number of exceptions, the failure rate is given by N/T.
Example 1: Computing the probability of exceptions
What is the probability of observing x = 0 exceptions out of a sample of T = 250 observations when the true probability (p) is 1%?
Solution
$$ \begin{align*} \text P(\text X=\text x) &= {{}^\text T} \text C_{\text x} \text P^{\text x} (1\text P)^{\text T\text x} \\ \text P(\text X=0) & ={{}^{250}} \text C_0 ×0.01^0 (10.01)^{2500}=0.08106 {\text{ or }} 8.1\% \\ \end{align*} $$
What this means is that we would expect to observe 8.1% of samples with zero exceptions under the null hypothesis. Of course we can repeat this calculation with different values for x. For example, the probability of observing x = 5 exceptions is 6.7%:
$$ \text P(\text X=5) ={{}^{250}} \text C_5 ×0.01^5 (10.01)^{2505}=0.06663 {\text { or }} 6.7\% $$
Example 2: Computing the probability of exceptions
Suppose a VaR of $100 million is calculated at a 99% confidence level. What is an acceptable probability of exception for exceeding this VaR amount?
Solution
We expect to have exceptions (losses exceeding $100m) 1% of the time (1 – 95%).
If exceptions are found to occur with greater frequency, we may be underestimating the actual risk. If exceptions are found to occur less frequently, we may be overestimating risk.
Example 3: Computing the number of exceptions
Based on a 90% confidence level, how many exceptions in backtesting a VAR would be expected over a 250day trading year?
Solution
The expected number of exceptions is T*p, where T is the sample size and p is the probability of exception (p = 1 – confidence level).
Expected exceptions = 250 * (1 – 0.90) = 25
In other words, we expect to have exceedances (losses exceeding the 90% VaR) 10% of the time (1 – 90%). That’s 0.1 * 250 = 25 days.
Model Calibration
To test whether a model is correctly calibrated (Null \(\text H_0\): correct model), we need to calculate the zstatistic. This statistic is then compared to the tabulated critical value at the preferred level of confidence (e.g., critical value = 1.96 at 95% confidence level).
$$ \text z=\cfrac {(\text x\text{pT})}{\sqrt {\text p(1\text p)\text T}} $$
Example 4: Model calibration
Over 252day period, daily sales fell below a predetermined VaR level (at the 95% confidence level) on 25 occasions. Is this sample unbiased (Is the model correctly calibrated)?
Solution
Null \(\text H_0\): model is unbiased
Alternative \(\text H_1\): model is not unbiased
The zstatistic is:
$$ \text z=\cfrac {\text x\text{pT}}{\sqrt {\text p(1\text p)\text T}}=\cfrac {250.05(252)}{\sqrt {0.05(10.05)252}}=3.5841 $$
Our statistic of 3.5841 lies outside the nonrejection region between 1.96 and 1.96 (the lower and upper \(2 \frac{1}{2}\%\) points of the normal distribution.
Therefore, we would reject the null hypothesis that the VaR model is unbiased and conclude that the maximum number of exceptions has been exceeded.
Note that the level of evidence against the null hypothesis is so strong that we would still reject the null even if we used a 99% level of confidence at which the critical value is 2.5758.
Example 5: model calibration
A trader in the capital markets estimates the oneday VAR, at the 95% confidence level, to be USD 50 million. Over the past 250 days, the USD 50 million loss mark has been breached 11 times. Is the model unbiased?
Solution
Null \(\text H_0\): model is unbiased
Alternative \(\text H_1\): model is not unbiased
$$ \text{The z}\text{statistic is}: \text z=\cfrac {\text x\text {pT}}{\sqrt {\text p(1\text p)\text T}}=\cfrac {110.05(250)}{\sqrt {0.05(10.05)252} }=0.43355 $$
Our statistic of 0.43355 lies inside the nonrejection regionbetween 1.96 and 1.96 (the lower and upper \(2 \frac{1}{2}\%\) points of the normal distribution). Therefore, we have insufficient evidence against the null hypothesis and conclude that the model is unbiased.
It is important to note a few things:
 For purposes of backtesting, the risk manager should choose a value of confidence level, c,that is not too high. For instance, let’s say we pick c = 99.99%. On average, we would expect one exceedance out of 10,000 trading days, or around 40 years. This would make it impossible to verify if the true probability associated with VAR is indeed 99.99%. The usual recommendation, therefore, is to pick a confidence level of 95% or 99%.
 The confidence level at which we choose to reject or fail to reject a model is in no way related to the confidence level at which the VaR was calculated. For instance, we could have a VaR computed at 95% confidence and choose to reject or fail to reject the model at 90% confidence.
Type I and Type II errors
Too many exceptions indicate that either the model is understating
VAR or the trader is unlucky. On the same note, too few exceptions indicate that either the model is overstating VaR or the trader is lucky. This begs the question: How do we decide which explanation is more likely?
It follows that any statistical testing framework must account for two types of errors:
Type II error: The probability of rejecting a correct model due to bad luck. In other words, the analyst mistakenly rejects the null.
Type 1 error is represented by \(\alpha\), the level of significance.
Type II error: The probability of not rejecting a model that is false. The analyst mistakenly fails to reject the null.
Type II error is denoted \(\beta\)
The power of a test is the probability of rejecting the null hypothesis when it is false, so that the power of the test = 1 – \(\beta\)
$$ \begin{array}{c} { \text{Null Hypothesis: Model is correctly calibrated}} \\ \end{array} $$
$$ \begin{array}{ccc} { \text{Decision} } & {\text{Null is correct}} & {\text{Null is incorrect}} \\ \hline {\text{Fail to reject}} & {\text{Good decision}} & {\text{Type II error}} \\ \hline {\text{Reject}} & {\text{Type I error}} & {\text{Good decision}} \\ \end{array} $$
The model verification test involves a tradeoff between Type I and Type II errors. One of the key goals in backtesting is to create a VaR model with a low Type I error and include a test for a very low Type II error rate.
It is very important to select a significance level that takes account of the likelihood of these errors (and, in theory, their costs as well) and strikes an appropriate balance between them. In practice, however, we usually select some arbitrary significance level such as 5% and apply it in all our tests. Why 5%? you might ask: This level of significance is considered of good magnitude that gives the model a certain benefit of doubt. It paves way for the rejection of the model only if the evidence against it is reasonably strong.
The decision to fail to reject the null hypothesis following an analysis of backtest results comes with the risk of a type II error because it remains statistically possible for a bad VaR model to produce an unusually low number of exceptions.
The decision to reject the null hypothesis following an analysis of backtest results comes with the risk of a type I error because it remains statistically possible for a good VaR model to produce an unusually high number of exceptions.
A test can be said to be reliable if it is likely to avoid both types of error when used with an appropriate significance level.
The figures below illustrate the two types of errors. Let’s consider an example where daily VaR is computed at a 99% percent confidence level for the 250day horizon. Assuming that the model is correct, the expected number of days when losses exceed VaR estimates is 250 ∗ 0.01 = 2.5. If we set the cutoff level for rejecting a model, for instance, to 5 exceptions, the probability of committing a type 1 error is 10.8%.
On the other hand, if the model has an incorrect coverage of 97%, the expected number of exceptions is 250 ∗ 0.03 =7.5.
The Need for Conditional Coverage in the Backtesting Framework
Up to this point, we have looked at backtesting based on unconditional coverage, in which the timing of exceptions has not been considered. Unconditional variation falls in line with the “independence” assumption where we treat the probability of tomorrow’s exception as independent of today’s exception.
In reality, however, there could be time variation in the way the exceptions are observed. Conditional coverage allows us to take account of factors that unconditional coverage ignores. Actual exceptions could cluster or bunch closely in time such that if we take the 95% VaR, for instance, the 13 expected exceptions over a 250day period could occur within a single month. A bunching of exceptions may be indicative of a change in market correlations or the alteration of trading positions. Consequently, it important to have a framework that guides us to determine whether the bunching is purely random or caused by one of these events.
P.F. Christofferson, a scholar, developed a measure of conditional coverage that allows for potential time variation of the data. It is basically an extension of the unconditional coverage test statistic, \(\text {LR}_{\text{UC}}\). The overall loglikelihood test statistic for conditional coverage is computed as:
$$ \text {LR}_{\text{CC}} = \text {LR}_{\text{UC}} + \text {LR}_{\text{ind}} $$
Each individual component is independently distributed as chisquared, and so is the sum.
At a confidence level of 95%, for instance, we would reject the model if \(\text {LR}_{\text{CC}} > 5.99\). We would reject the independence term alone if \(\text {LR}_{\text{ind}}\). If exceptions are found to be serially dependent, what should follow is a reexamining of the model to recognize the correlations in the data.
Basel Rules for Backtesting
In a bid to make banks more observant and adherent to the highest level of risk management, the Basel Committee continually releases guidelines on a range of issues. In line with its mandate, the committee has put in place a framework based on the daily backtesting of VAR.
Current guidelines require banks to record daily exceptions to the 99.0% VaR over the previous year. For 250 observations, the expected number of exceptions is 0.01 * 250 = 2.5.
Basel rules put the number of exceptions into three categories:
 The green zone is defined as having up to four exceptions – an acceptable number. The green zone is effectively a “green light” for the bank to continue using the existing model in risk management. Any bank within this zone does not attract penalties imposed in form of a multiplicative (“scaling”) factor (k).
 The yellow zone is defined as having between five and nine exceptions. Unlike the green zone, the yellow zone attracts penalties and is effectively a call to action to go back to the drawing board and develop a better model. The specific scaling factor to be imposed is left to the discretion of supervisors and to a big extent depends on the reason for the exceptions.
 The red zone is defined as having 10 or more exceptions. Like the zone before it, it comes with an automatic penalty, where the scaling factor is increased from 3 to 4. The penalty is imposed in a nondiscretionary manner. The red zone requires the bank to undertake extensive review of its model since it would be extremely unlikely to observe 10 or more exceptions if the model was indeed correct. In some cases, the bank may be forced to abandon the model altogether.
As noted earlier, the supervisor enjoys some discretion in the application of penalties for exceptions falling within the yellow zone. The Basel Committee uses these categories:

Basic integrity of the model: This implies a bank’s systems are poor at capturing the risks of the various positions taken. For instance, correlations may have been mispecified.
Committee’s guidance: This is a very serious flaw that calls for an increase in the scaling factor penalty that should apply immediate corrective action. For instance, the supervisor may be required to authorize a substantial review of the model and take action to ensure that this occurs. 
Deficient model accuracy: This implies that the model does not measure risk exposure of some instruments with enough precision.
Committee’s guidance: A lack of model precision is a fairly common flaw that occurs in most risk measurement models. Indeed, no single model is fully immune from some kind of imprecision. If there’s reason to believe that a bank’s model accuracy is significantly wanting when compared to other banks, the supervisor should impose the plus factor and set in motion any other necessary corrective action. 
Intraday trading: The exceptions occurred due to trading activity that occurred within a 24hour period. It could be a large (moneylosing) trading event that happened between the end of the first day and the end of the second day.
Committee’s guidance: If the exception disappears with the hypothetical return, the problem is not in the bank’s VAR model. Nonetheless, a penalty “should be considered.” 
Bad luck: Either the markets moved more than the model predicted was likely, or the markets did not move together as expected.
Committee’s guidance: Markets may move in an anticipated fashion from time to time. These types of exceptions “should be expected to occur at least some of the time.”Even among “accurate” models, 100% market movement prediction rate is nearly impossible. There’s no single VaR model that is immune from bad luck.
Question 1
A risk manager observed the following pattern exceptions on a particular year. A fraction \(\pi \) which was as a result of 23 exceptions of 252 days. 7 of these exceptions occurred following an exception the previous day. Alternatively, 16 exceptions occurred when non was there the previous day. Write an expression to represent the relevant test statistic \({ LR }_{ ind }\).
 \({ LR }_{ ind }=2\quad ln\left\{ { \left( 10.091 \right) }^{ \left( { T }_{ 00 }+{ T }_{ 10 } \right) }{ \pi }^{ \left( { T }_{ 01 }+{ T }_{ 10 } \right) } \right\} +2ln\left\{ { \left( 10.0635 \right) }^{ { T }_{ 00 } }{ 0.0635 }^{ { T }_{ 01 } }{ \left( 10.304 \right) }^{ { T }_{ 10 } }{ 0.304 }^{ { T }_{ 11 } } \right\} \)
 \({ LR }_{ ind }=2\quad ln\left\{ { \left( 10.091 \right) }^{ \left( { T }_{ 00 }+{ T }_{ 10 } \right) }{ \pi }^{ \left( { T }_{ 01 }+{ T }_{ 10 } \right) } \right\} +2ln\left\{ { \left( 10.0699 \right) }^{ { T }_{ 00 } }{ 0.0699 }^{ { T }_{ 01 } }{ \left( 10.304 \right) }^{ { T }_{ 10 } }{ 0.304 }^{ { T }_{ 11 } } \right\} \)
 \({ LR }_{ ind }=2\quad ln\left\{ { \left( 10.091 \right) }^{ \left( { T }_{ 00 }+{ T }_{ 10 } \right) }{ \pi }^{ \left( { T }_{ 01 }+{ T }_{ 10 } \right) } \right\} +2ln\left\{ { \left( 10.0635 \right) }^{ { T }_{ 00 } }{ 0.0635 }^{ { T }_{ 01 } }{ \left( 10.438 \right) }^{ { T }_{ 10 } }{ 0.438 }^{ { T }_{ 11 } } \right\} \)
 \({ LR }_{ ind }=2\quad ln\left\{ { \left( 10.091 \right) }^{ \left( { T }_{ 00 }+{ T }_{ 10 } \right) }{ \pi }^{ \left( { T }_{ 01 }+{ T }_{ 10 } \right) } \right\} +2ln\left\{ { \left( 10.0699 \right) }^{ { T }_{ 00 } }{ 0.0699 }^{ { T }_{ 01 } }{ \left( 10.438 \right) }^{ { T }_{ 10 } }{ 0.438 }^{ { T }_{ 11 } } \right\} \)
The correct answer is B.
Solution
Recall that the relevant test statistic is:
$$ { LR }_{ ind }=2\quad ln\left[ { \left( 1\pi \right) }^{ \left( { T }_{ 00 }+{ T }_{ 0 } \right) }{ \pi }^{ \left( { I }_{ 01 }+{ T }_{ 11 } \right) } \right] +2\quad ln\left[ { \left( 1{ \pi }_{ 0 } \right) }^{ { T }_{ 00 } }{ \pi }_{ 0 }^{ { T }_{ 01 } }{ \left( 1{ \pi }_{ 1 } \right) }^{ { T }_{ 10 } }{ \pi }_{ 1 }^{ { T }_{ 11 } } \right] $$
Note,
$$ \pi ={ \pi }_{ 0 }={ \pi }_{ 1 }={ \left( { T }_{ 01 }+{ T }_{ 11 } \right) }/{ T } $$
Therefore:
$$ \pi ={ \left( { T }_{ 01 }+{ T }_{ 11 } \right) }/{ T }={ \left( 7+16 \right) }/{ 252 }=0.091 $$
And:
\( { \pi }_{ 0 }={ 16 }/{ 229 }=0.0699 \) which is \(6.99\) percent,
\({ \pi }_{ 1 }={ 7 }/{ 23 }=0.304 \) which is \(30.4\) percent.
\( \Rightarrow { LR }_{ ind }=2\quad ln\left\{ { \left( 10.091 \right) }^{ \left( { T }_{ 00 }+{ T }_{ 10 } \right) }{ \pi }^{ \left( { T }_{ 01 }+{ T }_{ 10 } \right) } \right\} +2ln\left\{ { \left( 10.0699 \right) }^{ { T }_{ 00 } }{ 0.0699 }^{ { T }_{ 01 } }{ \left( 130.4 \right) }^{ { T }_{ 10 } }30.4^{ { T }_{ 11 } } \right\} \)