Hypothesis Testing and Confidence Intervals

After completing this reading, you should be able to:

  • Calculate and interpret the sample mean and sample variance.
  • Construct and interpret a confidence interval.
  • Construct an appropriate null and alternative hypothesis, and calculate an appropriate test statistic.
  • Differentiate between a one-tailed and a two-tailed test and identify when to use each test.
  • Interpret the results of hypothesis tests with a specific level of confidence.
  • Demonstrate the process of backtesting VaR by calculating the number of exceedances.

Sample Mean

Consider a scenario where the output from a standard random number generator from a computer is taken and multiplied by one thousand. This will yield to the data generating process (DGP), that is a uniform variable and ranges between 0 and 500, whose mean is 500.

However, if two hundred entries are generated from the DGP and their sample mean computed, chances are that the sample mean cannot be exactly five hundred.

One fact that candidates should be aware of is that the sample mean is a random variable. If the process of generating the 200 points and computing their sample is done repeatedly, then every calculation will yield a different sample mean. Asserting that the mean of the sample mean is the true mean is a correct observation. We can say that:

$$ E\left[ \hat { \mu } \right] =\mu $$

If more data points are generated, say 700 points, as opposed to the original 200 points, the sample mean will be closer to the expected value of the sample mean. This can be attributed to the fact that, as compared to the effect on a pool with 200 data points, an outlier will have an insignificant effect on the pool with more data points.

Furthermore, our sample mean will have a lower standard deviation when more data points are used in comparison to when they are less, say 200.

However, there will be a predictable pattern in the manner in which our sample variance decrease, proportional to the sample size, rather than just decreasing with the sample mean.

Supposing that we have a sample of size \(n\) whose true variance is \({ \sigma }^{ 2 }\). Then the sample mean will have a sample variance described by the following formula:

$$ { \sigma }_{ \hat { \mu } }^{ 2 }=\frac { { \sigma }^{ 2 } }{ n } $$

Therefore, the sample mean will have a standard deviation that decreases with the square root of \(n\). In order to reduce the standard deviation of the mean by a factor of 5, we’d need 25 times as many data points. In order to reduce the standard deviation of the mean by a factor of 8, we’d need 64 times as many data points. All this makes use of the square root rule for \(i.i.d\) variables.

We can also rewrite the sample mean formula as:

$$ \hat { \mu } =\frac { 1 }{ n } \sum _{ i=1 }^{ n }{ { x }_{ i } } =\sum _{ i=1 }^{ n }{ \frac { 1 }{ n } { x }_{ i } } $$

If each \( \left(\frac{ 1 }{ n }\right) { x }_{ i }\)’s is considered as a random variable by itself, then we observe that our sample mean becomes equivalent to the summation of our \(n\) \(i.i.d\) random variables each having a mean of \(\frac { \mu }{ n } \) and standard deviation of \(\frac { \sigma }{ n } \).

The distribution of the sample mean then converges to a normal distribution if the central limit theorem is applied.

Sample Variance

The sample variance can also be assumed to be a random variable, just like the sample mean. In the same way we dealt with the sample mean in the DGP, if the sample variance is computed repeatedly, then the true variance and the expected value of our sample variance will be equal. The sample variance and the variance will be:

$$ E\left[ { \left( { \hat { \sigma } }^{ 2 }-{ \sigma }^{ 2 } \right) }^{ 2 } \right] ={ \sigma }^{ 4 }\left( \frac { 2 }{ n-1 } +\frac { { K }_{ ex } }{ n } \right) $$

\(n\) is the size of the sample while \({ K }_{ ex }\) is the excess kurtosis. As far as the shape of distribution is concerned, the following equation shows that the estimator follows a chi-squared distribution with \(\left( n-1 \right) \) degrees of freedom:

$$ \left( n-1 \right) \frac { { \hat { \sigma } }^{ 2 } }{ { \sigma }^{ 2 } } \sim { X }_{ n-1 }^{ 2 } $$

The sample variance is given as \({ \hat { \sigma } }^{ 2 }\), and the population variance is \({ \sigma }^{ 2 }\). \(n\) is the number of sample points. The above expression will only hold if the DGP is normally distributed.

Confidence Intervals

Suppose we first apply the sample standard deviation to standardize the sample mean estimate. This will yield a random variable with \(\left( n-1 \right) \) degrees of freedom that follows the student’s \(t\) distribution:

$$ t=\frac { \hat { \mu } -\mu }{ { \hat { \sigma } }/{ \sqrt { n } } } $$

If both the numerator and the denominator are divided by the population standard deviation, then the numerator will have a standard normal variable, and the denominator – the square root of a chi-square variable with an appropriate constant. It is known that such a combination of variables obeys the student’s \(t\) distribution. This popular standardized version of the population mean is called the \(t\)–statistic.

The overall \(t\)–statistic will converge to a \(t\) distribution provided the sample size is large. otherwise the \(t\) – statistic may not be accurately approximated by the \(t\) distribution in the event that the data is non-normal.

We can determine the probability of the \(t\)–statistic being contained within a given range if we correctly observe the appropriate \(t\) distribution values:

$$ P\left[ { x }_{ L }\le \frac { \hat { \mu } -\mu }{ { \hat { \sigma } }/{ \sqrt { n } } } \le { x }_{ U } \right] =\gamma $$

The distribution’s lower and the upper bound ranges are defined by the constants, \({ x }_{ L }\), and \({ x }_{ U }\), respectively, with \(\gamma\) being the chances of the \(t\)-statistic within that range. \(\gamma\) is called the confidence range level. The quantity \(\left( 1-\gamma \right) \) is quite popular and used instead of directly applying \(\gamma\), and is called the significance level, \(\alpha\). The confidence level and the significance level are inversely proportional.

If the above equation is rearranged, then we have that:

$$ P\left[ \hat { \mu } -\frac { { x }_{ L }\hat { \sigma } }{ \sqrt { n } } \le \mu \le \hat { \mu } +\frac { { x }_{ U }\hat { \sigma } }{ \sqrt { n } } \right] =\gamma $$

This formulation results in a range that is also known as the confidence interval for the population mean.

For example, we may have a 95% confidence interval of the mean μ given as [15.5, 16.2]. This implies that, in the long run, 95% of the realizations of such intervals will include μ and 5% of the realizations will not include μ.

Confidence intervals are usually constructed by adding or subtracting an appropriate value from the point estimate. In general, confidence intervals take on the following form:

Point Estimate ± Reliability factor × Standard error

The Z-test

The reliability factor can either take the form of the Z-result or the t-result, depending on whether or not the population variance is known. If the population has a normal distribution with a known variance, a confidence interval for the population mean can be calculated as:

X̄ ± Z𝛼/2 × σ/√n

The reliability factor here is simply the z-score that leaves a probability of α/2 in each tail of the normal distribution.

normal-distribution-tailsThe two blue tails are the alpha level divided by two (α/2). Luckily, you don’t have to look up the z-scores every time:

Confidence level Alpha Alpha/2 Z alpha/2
90% 10% 5% 1.645
95% 5% 2.50% 1.96
99% 1% 0.50% 2.576

The t-test

If the population has a normal distribution with an unknown variance, a confidence interval for the population mean can be calculated as:

X̄ ± t𝛼/2 × s/√n

Where S is the sample standard deviation.

The reliability factor here is simply the t-score that leaves a probability of α/2 in each tail of the t-distribution. Furthermore, when looking up t-scores in statistical tables, the degrees of freedom (n-1) must be taken to account.

Summary: Criteria for Selecting Appropriate Reliability Factor


Reliability Factor

When sampling from a:

Small sample (n < 30)

Large sample (n ≥ 30)

Normal distribution with known variance



Normal distribution with unknown variance



Nonnormal distribution with known variance

not available


Nonnormal distribution with unknown variance

not available


Note: As long as the sample size is less than 30, we must assume that the underlying variable takes on the normal distribution; otherwise it is impossible to construct a confidence interval.

Hypothesis Testing

The concept of hypothesis testing kicks in when we wish to establish the probability of a certain population mean exceeding a certain value, say, \(y\). This question has been traditionally put in a null hypothesis.

We can write:

$$ { H }_{ 0 }:{ \mu }_{ r }>20\% $$

Suppose we wish to establish if a portfolio manager’s expected return surpasses the 20% mark. The null hypothesis is denoted by \({ H }_{ 0 }\). The population mean is assumed to be 20% for the hypothesis. The \(t\)-statistic that is appropriate is written as follows, in the above case scenario:

$$ t=\frac { \mu -20\% }{ { \hat { \sigma } }/{ \sqrt { n } } } $$

In case we wish to determine the probability of seeing a given sample mean, if the true population mean is 20%, we can look it up from the \(t\) distribution.

Furthermore, an alternative hypothesis could be offered where the expected return could be less than or equal to 20%:

$$ { H }_{ 1 }:{ \mu }_{ r }\le 20\% $$

Any number of hypotheses could be practically tested, but the null hypothesis is often popular.

Which Way to Test?

The null hypothesis is constructed by most practitioners in a way that gives the impression the desired outcome is false. It’s the hypothesis the researcher wants to reject. It’s generally a simple statement that describes the “status quo.” Typical statements include:

$$ { H }_{ 0 }:\mu ={ \mu }_{ 0 } $$

$$ { H }_{ 0 }:\mu \le { \mu }_{ 0 } $$

$$ { H }_{ 0 }:\mu \ge{ \mu }_{ 0 } $$

The alternative hypothesis, denoted as \({ H }_{ 0 }\), is what the researcher concludes when there’s sufficient evidence to reject the null hypothesis. Note that we cannot conclude the null hypothesis is true or false.

That’s not what hypothesis testing is all about. It’s about establishing whether there’s evidence to reject the status quo.

One Tail or Two Tails?

A null hypothesis that is two-tailed is written as follows:

$$ { H }_{ 0 }:\mu =0 $$

$$ { H }_{ 1 }:\mu \neq 0 $$

According to \({ H }_{ 1 }\), we reject the null hypothesis due to positive or negative values. The two-tailed test is considered when the researcher is interested in the two sides of a distribution.

Any of the following pairs of hypotheses represent a one-tailed test:

$$ { H }_{ 0 }:\mu > c $$

$$ { H }_{ 1 }:\mu > 0 $$


$$ { H }_{ 0 }:\mu = c $$

$$ { H }_{ 1 }:\mu < 0 $$


$$ { H }_{ 0 }:\mu = c $$

$$ { H }_{ 1 }:\mu \ge 0 $$

The one-tailed test becomes convenient to use in the event that the deviations in one direction are our only concern. If we want to establish the deviations in two directions (positive and negative), we use a two-tailed test. In both risk management and sciences, the 95% confidence level in the most widely applied confidence level.


Statistical Errors

Type I error occurs when we reject a true null hypothesis.

Example: Rejecting H0:μ=0 when μ is, in fact, equal to zero.

Type II error is the failure to reject a false null hypothesis.

Example: Failure to reject H0:μ=0 when μ is in fact NOT equal to zero.

The level of significance, α, represents the probability of making a type I error, i.e., rejecting the null hypothesis when it’s in fact true. We use α to determine critical values that subdivide the distribution into the rejection and the non-rejection regions.

Chebyshev’s Inequality

According to the Chebyshev’s inequality, for a random variable \(X\) whose standard deviation is \(\sigma\), then the following probability holds:

$$ P\left( X-\mu |\ge n\sigma \right) \le \frac { 1 }{ { n }^{ 2 } } $$

This is the Chebyshev’s inequality. It represents the probability that \(X\) is within \(n\) standard deviations of \(\mu\).

Application: VaR

Let’s first define a random variable \(L\), before we can formally define the Value at Risk. The loss to our portfolio is represented by \(L\), and it is a simple negation of the return to our portfolio. The VaR, on the other hand, is defined as follows:

$$ P\left[ L\ge { VaR }_{ \gamma } \right] =1\quad \quad \dots \dots \dots \dots equation\quad I $$

Where \(\gamma\) is a particular confidence level.

The VaR can also be defined in terms of returns. The above inequality is multiplied by -1 on both sides and \(–L\) is replaced with \(R\). Therefore:

$$ P\left[ R\ge { -VaR }_{ \gamma } \right] =1-\gamma \quad \quad \dots \dots \dots \dots equation\quad II $$

These two equations are equivalent.


The choice of a confidence level that is appropriate has always been a challenge when VaR is applied. Most newcomers have a tendency of choosing a high confidence level which happens to be rather conservative.

Responsible risk managers are supposed to backtest their models regularly. In backtesting, we evaluate the predicted outcome of a model against actual data. Backtesting of any model parameter is possible.

The VaR is one such model that can be easily backtested. When a VaR model is assessed we consider each period as a Bernoulli trial. Exceedances gets distributed with respect to a binomial distribution within a timeframe of \(n\) days, as shown below since exceedance events are independent:

$$ P\left[ K=k \right] =\left( \begin{matrix} n \\ k \end{matrix} \right) { p }^{ k }{ \left( 1-p \right) }^{ n-k } $$

Where there are \(n\) periods used in our backtest, there are \(k\) exceedances, and the confidence level is \(\left( 1-p \right) \).



Assume that the value at risk (VaR) of a portfolio, at a 95% confidence interval, is $10 million. Within a 200-day trading period, daily losses happen to exceed $10 million 18 times. Is this VaR model effective?

With a 95% confidence interval, we expect to have exceptions (i.e., losses exceeding $10 million) 5% of the time. In 200 days, we expect no more than 10 exceedances. If the losses exceeding $10 million occurred nine times during the 200-day period, exceptions occurred 18% of the time, and the model is, therefore, underestimating risk.


The simplicity of the VaR model is what makes it appealing. With VaR, the risks of different portfolios are easily compared since the computation of VaR can be calculated for all portfolios. Another distinguishing feature of VaR is that we can conveniently track down the risk of a particular portfolio over time as VaR summarizes its risk to a single number.

However, financial institutions are accused of being overly reliant on VaR. VaR has been accused of not being a subadditive risk measure. Subadditivity is a property that should be possessed by any logical risk measure.

Supposing that we are given a function \(f\) to be our risk measure. The function takes a random variable representing an asset or a portfolio of assets as its input. Greater risks are associated with higher values of the risk measure.

The following is the condition for subadditivity of function \(f\), given that \(X\) and \(Y\) are two risky portfolios:

$$ f\left( X+Y \right) \le f\left( X \right)+f\left( Y \right) $$

Subadditivity implies that the risk of a portfolio must be less than the sum of risks for portfolio components, but this is not always the case, especially under the quantile-based VaR measure.

Expected Shortfall

The VaR model has also been criticized for giving no information concerning the tail of the distribution. Here, we are interested in the size of the loss in case of an exceedance event. The expected value of a loss could be defined as follows, given an exceedance:

$$ E\left( L|L\ge { VaR }_{ \gamma } \right) =S $$

This conditional expected loss is referred to as the expected shortfall. The concept of conditional probability is quite applicable in this case.

Supposing that we are given a profit function whose PDF is \(f\left( x \right) \), and a VaR at the \(\gamma\) confidence level, then the expected shortfall is computed as follows:

$$ S=\frac { 1 }{ 1-\gamma } \int _{ -\infty }^{ VaR }{ xf\left( x \right)dx } $$

This expected shortfall has a tendency of being positive, has been defined with respect to losses, similar to the VaR. However, as opposed to the VaR, the expected shortfall has the property of subadditivity.


An experiment was done to find out the number of hours that candidates spend preparing for FRM part 1 exam. It was discovered that for a sample of 10 students, the following times were spent:

318, 304, 317, 305, 309, 307, 316, 309, 315, 327

If the sample mean and standard deviation are 312.7 and 7.2 respectively, calculate a symmetrical 95% confidence interval for the mean time a candidate spends preparing for the exam using the t-table.


  1. [307.5, 317.9]
  2. [307.6, 317.8]
  3. [307.9, 317.5]
  4. [307.3, 318.2]

The correct answer is A.

Population variance is unknown; we must use the t-score.

Our t𝛼/2 value is given by using the table with (10 – 1 =) 9 degrees of freedom and the (1 – 0.025 =) 0.975 which gives us 2.262.

So the confidence interval is given by:

X̄ ± t𝛼/2 × s/√n

= 312.7 ± 2.262 × 7.2/√10 = [307.5, 317.9]


Leave a Comment