Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10.

Contingency Tables

quantitative-methods

Contingency Tables

01 Nov 2022

A contingency table is a tabular representation of category-based data. It shows the frequencies for particular combinations of values for two discrete random variables, say X and Y. Each cell in the table represents a mutually exclusive combination of X-Y values. A contingency table for two category-based variables is also known as a two-way table.

Example: Contingency Table

The following contingency table shows a hypothetical frequency distribution of women’s education level preference in three countries among a random sample of 120 females:

$$
\begin{array}{l|c|c|c|c}
\begin{array}{l}
\textbf { Education } \\
\textbf { level }
\end{array} & \textbf { Kenya } & \textbf { Uganda } & \textbf { Tanzania } & \textbf { Total } \\
\hline \begin{array}{l}
\text { Middle school } \\
\text { or lower }
\end{array} & 5 & 5 & 30 & 40 \\
\hline \text { High school } & 5 & 25 & 5 & 35 \\
\hline \text { Bachelor’s } & 15 & 5 & 5 & 25 \\
\hline \text { Master’s } & 15 & 5 & 0 & 20 \\
\hline \text { Total } & 40 & 40 & 40 & 120 \\
\end{array}
$$

The above table indicates that a ‘Middle school or lower level’ of education is dominant in Tanzania, while ‘High school’ is dominant in Uganda. Besides, according to the table, Kenyan women most often go up to a ‘Bachelor’s’ or ‘Master’s’ degree level. We can also see that no Tanzanian woman has a ‘Master’s’ degree within the sample.

Joint frequency is the number of times a combination of two conditions happens together. For example, ‘Kenya’ and ‘Middle school or lower’ have a joint frequency of 5.

The sum of the joint frequencies across rows and columns is called marginal frequencies. For example, the marginal frequency of ‘Bachelor’s’ degree is the sum of the joint frequencies across all three countries, that is, 25 (= 15 + 5 + 5). ‘Middle school or lower’ and ‘High school’ have the largest marginal frequencies.

We can also create contingency tables by using relative frequencies. For example, the preference for ‘High school’ in Uganda is \(\frac{25}{120} = 21\%\).

$$\begin{aligned} &\textbf{Relative Frequency as % of total}\\
&\begin{array}{l|c|c|c|c}
\begin{array}{l}
\textbf { Education } \\
\textbf { Level }
\end{array} & \textbf { Kenya } & \textbf { Uganda } & \textbf { Tanzania } & \textbf { Total } \\
\hline \begin{array}{l}
\text { Middle school or } \\
\text { lower }
\end{array} & 4 \% & 4 \% & 25 \% & \mathbf{3 3} \% \\
\hline \text { High school } & 4 \% & 21 \% & 4 \% & \mathbf{2 9} \% \\
\hline \text { Bachelor’s } & 13 \% & 4 \% & 4 \% & \mathbf{2 1} \% \\
\hline \text { Master’s } & 13 \% & 4 \% & 0 \% & \mathbf{1 7} \% \\
\hline \textbf { Total } & \mathbf{3 3 \%} & \mathbf{3 3} \% & \mathbf{3 3 \%} & \mathbf{1 0 0} \% \\
\end{array} \end{aligned}$$

$$\begin{aligned}&\textbf{Relative Frequency: Frequency Distribution in the Region as % of}\\&\textbf{Total Frequency (Preference)}\\
&\begin{array}{l|c|c|c|c}
\textbf { Education Level } & \textbf { Kenya } & \textbf { Uganda } & \textbf { Tanzania } & \textbf { Total } \\
\hline \text { Middle school or lower } & 13 \% & 13 \% & 75 \% & \mathbf{3 3} \% \\
\hline \text { High school } & 13 \% & 63 \% & 13 \% & \mathbf{2 9} \% \\
\hline \text { Bachelor’s } & 38 \% & 13 \% & 13 \% & \mathbf{2 1} \% \\
\hline \text { Master’s } & 38 \% & 13 \% & 0 \% & \mathbf{1 7} \% \\
\hline \textbf { Total } & \mathbf{1 0 0} \% & \mathbf{1 0 0} \% & \mathbf{1 0 0} \% & \mathbf{1 0 0} \% \\
\end{array}\end{aligned}
$$

Confusion Matrix

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions is summarized with count values and broken down by each class. A confusion matrix is one of the applications of a contingency table.

Elements of a Confusion Matrix

A confusion matrix represents different combinations of actual versus predicted values.

True Positive (TP): The values which were actually positive and were predicted as such.
False Positive (FP): The values which were actually negative but were falsely predicted as positive. These values are also known as Type I errors.
False Negative (FN): The values which were actually positive but were falsely predicted as negative. These values are also known as Type II errors.
True Negative (TN): The values which were actually negative and were predicted as such.

Example: Confusion Matrix

Let us consider the example of a stock market negative return prediction using a confusion matrix. Assume that we have 1000 records in our dataset regarding the stock market return of negative 20% or more. Refer to the following confusion matrix:

$$
\begin{array}{ll|ll|l}
& & \textbf { Stock Market negative } & & \\
& & \textbf { return of 20 % or more } & & \\
& & \textbf { Actual values } & & \\
& & & & \\
& & \text { YES } & \text { NO } & \textbf { Total } \\\hline
\textbf { Stock Market Negative } & \text { YES } & 460 & 200 & 660 \\
\textbf { Return of 20% or more } & & & & \\
\textbf { Predicted Values } & \text { NO } & 150 & 190 & 340 \\
\end{array}
$$

In the above matrix, we can make the following deductions:

True positive: 460 records of the stock market negative return of 20% or more were predicted correctly by the model.
False-positive: 200 records of non-stock market negative return of 20% or more were wrongly predicted as a market crash.
False-negative: 150 records of the stock market negative return of 20% or more were wrongly predicted, i.e., they were not considered a market crash.
True negative: 190 records of non-stock market negative return of 20% or more were predicted correctly by the model.

A contingency table can also be used to investigate the potential association between two category-based variables. We can test the association between category-based variables by performing a chi-square test of independence. This involves the following steps:

Using the marginal frequencies in the contingency table to construct a table with expected values of the observations.
Use of actual values and expected values to derive the chi-square test statistic.
Compare the test statistic to a value from the chi-square distribution for a given level of significance.
If the test statistic is smaller (larger) than the chi-square distribution value, then we can (cannot) reject the claim that a significant association exists between the categorical variables.

The following example describes how a contingency table is used to set up this test of independence.

Example: Contingency Tables and Association between Two Categorical Variables

We have 200 bonds, and we can classify them in two ways: by issuer, either a corporate or a financial institution, and by risk level, either low risk or high risk. The data are summarized in a 2 × 2 contingency table shown below.

$$
\begin{array}{l|c|c}
\textbf { Contingency Table } & & \\
\hline & \textbf { Low Risk } & \textbf { High Risk } \\
\hline \textbf { Bonds issued by a Corporation } & 27 & 57 \\
\hline \begin{array}{l}
\textbf { Bonds issued by a Financial } \\
\textbf { Institution }
\end{array} & 95 & 21 \\
\end{array}
$$

Questions

Calculate the number of corporate bonds and the number of financial institutions’ bonds out of the total bonds.
Calculate the number of low-risk and high-risk bonds out of the total bonds.
Describe how the contingency table is used to set up a test for independence between a bond’s issuer type and risk level.

The Solution to Question 1

The task is to calculate the marginal frequencies based on the type of issuer. To do this, we must add joint frequencies across the rows. Therefore, the marginal frequency for corporations is 27 + 57 = 84, and the marginal frequency for financial institutions is 95 +21 = 116.

The Solution to Question 2

The task is to calculate the marginal frequencies based on the bond’s risk. We do this by adding joint frequencies down the columns. Therefore, the marginal frequency for low risk is 27 + 95 = 122, and the marginal frequency for high risk is 57 + 21 = 78.

The Solution to Question 3

Based on the procedure for conducting a chi-square test of independence, we would perform the following three steps:

Step 1: Add the marginal frequencies and overall total to the contingency table. We have also included the relative frequency table for observed values.

$$
\begin{array}{l|c|c|c|l|c|c|c}
& \textbf { Low Risk } & \textbf { High Risk } & & & \begin{array}{c}
\textbf { Low } \\
\textbf { Risk }
\end{array} & \begin{array}{c}
\textbf { High } \\
\textbf { Risk }
\end{array} & \\
\hline \begin{array}{l}
\text { Bonds issued by } \\
\text { Corporations }
\end{array} & 27 & 57 & 84 & \begin{array}{l}
\text { Bonds issued } \\
\text { by Corporations }
\end{array} & 32 \% & 68 \% & 100 \% \\
\hline \begin{array}{l}
\text { Bonds issued by } \\
\text { Financial } \\
\text { Institutions }
\end{array} & 95 & 21 & 116 & \begin{array}{l}
\text { Bonds issued } \\
\text { by Financial } \\
\text { Institutions }
\end{array} & 82 \% & 18 \% & 100 \% \\
\hline & 122 & 78 & 200 & & & & \\
\end{array}
$$

Step 2: Use the marginal frequencies in the contingency table to construct a table with the expected values of the observations. To determine the expected values for each cell, multiply the respective row total by the respective column total, then divide by the overall total. So, for cell i,j (in i^th row and j^th column):

$$\text{Expected Value}_{i,j} = \frac{\text{Total Row i × Total Column j}}{\text{Overall Total}}$$

For example,

The expected value for corporate bonds or low risk is:

$$\frac{84 × 122}{ 200} = 51.24$$

And expected value for financial institutions or high risk is:

$$ \frac{116 × 78}{200} = 45.24.$$

The table of expected values (and accompanying relative frequency table) would look like this:

$$
\begin{array}{l|c|c|c|l|c|c|c}
& \begin{array}{l}
\textbf { Low } \\
\textbf { Risk }
\end{array} & \begin{array}{l}
\textbf { High } \\
\textbf { Risk }
\end{array} & & & \begin{array}{l}
\textbf { Low } \\
\textbf { Risk }
\end{array} & \begin{array}{c}
\textbf { High } \\
\textbf { Risk }
\end{array} & \\
\hline \text { Bonds issued by Corporations } & 51.2 & 32.8 & 84 & \begin{array}{l}
\text { Bonds } \\
\text { issued by } \\
\text { Corporations }
\end{array} & 61 \% & 39 \% & 100 \% \\
\hline \begin{array}{l}
\text { Bonds issued by Financial } \\
\text { Institutions }
\end{array} & 70.8 & 45.2 & 116 & \begin{array}{l}
\text { Bonds } \\
\text { issued by } \\
\text { Financial } \\
\text { Institutions }
\end{array} & 61 \% & 39 \% & 100 \% \\
\end{array}
$$

Step 3: Derive the chi-square test statistic and then compare it to a value from the chi-square distribution for a given level of significance.

If the test statistic is greater than the chi-square distribution value, we approve the evidence of a significant association between the categorical variables.
If the test statistic is less than the chi-square distribution value, we reject the evidence of a significant association between the categorical variables.

Question

The following contingency table shows the frequency of consumption of three common brands of bread in three geographic regions among a random sample of 100 consumers:

$$
\begin{array}{l|c|c|c}
\textbf { Brands } & \textbf { Region 1 } & \textbf { Region 2 } & \textbf { Region 3 } \\
\hline \mathrm{A} & 5 & 5 & 30 \\
\hline \mathrm{B} & 5 & 25 & 5 \\
\hline \mathrm{C} & 15 & 5 & 5 \\
\end{array}
$$

Which brand most likely has the highest marginal frequency?

Brand A.

Brand B.

Brand C.

Solution

The correct answer is A.

Note that the marginal frequency of a contingency table is the sum of the joint frequencies across rows and columns. In this case, we are dealing with the rows (Bread brands). Based on the calculation below, Brand A has the highest marginal frequency.

$$
\begin{array}{l|c|c|c|c}
\textbf { Brands } & \textbf { Region 1 } & \textbf { Region 2 } & \textbf { Region 3 } & \textbf { Total } \\
\hline \mathrm{A} & 5 & 5 & 30 & 40 \\
\hline \mathrm{B} & 5 & 25 & 5 & 35 \\
\hline \mathrm{C} & 15 & 5 & 5 & 25 \\
\end{array}
$$

Sergio Torrico

2021-07-23

Excelente para el FRM 2 Escribo esta revisión en español para los hispanohablantes, soy de Bolivia, y utilicé AnalystPrep para dudas y consultas sobre mi preparación para el FRM nivel 2 (lo tomé una sola vez y aprobé muy bien), siempre tuve un soporte claro, directo y rápido, el material sale rápido cuando hay cambios en el temario de GARP, y los ejercicios y exámenes son muy útiles para practicar.

diana

2021-07-17

So helpful. I have been using the videos to prepare for the CFA Level II exam. The videos signpost the reading contents, explain the concepts and provide additional context for specific concepts. The fun light-hearted analogies are also a welcome break to some very dry content. I usually watch the videos before going into more in-depth reading and they are a good way to avoid being overwhelmed by the sheer volume of content when you look at the readings.

Kriti Dhawan

2021-07-16

A great curriculum provider. James sir explains the concept so well that rather than memorising it, you tend to intuitively understand and absorb them. Thank you ! Grateful I saw this at the right time for my CFA prep.

nikhil kumar

2021-06-28

Very well explained and gives a great insight about topics in a very short time. Glad to have found Professor Forjan's lectures.

Marwan

2021-06-22

Great support throughout the course by the team, did not feel neglected

Benjamin anonymous

2021-05-10

I loved using AnalystPrep for FRM. QBank is huge, videos are great. Would recommend to a friend

Daniel Glyn

2021-03-24

I have finished my FRM1 thanks to AnalystPrep. And now using AnalystPrep for my FRM2 preparation. Professor Forjan is brilliant. He gives such good explanations and analogies. And more than anything makes learning fun. A big thank you to Analystprep and Professor Forjan. 5 stars all the way!

michael walshe

2021-03-18

Professor James' videos are excellent for understanding the underlying theories behind financial engineering / financial analysis. The AnalystPrep videos were better than any of the others that I searched through on YouTube for providing a clear explanation of some concepts, such as Portfolio theory, CAPM, and Arbitrage Pricing theory. Watching these cleared up many of the unclarities I had in my head. Highly recommended.

Data Organization for Quantitative Analysis

Data Visualization

quantitative-methods

Unconditional Probability Using the To ...

We can use the total probability rule to determine the unconditional probability of... Read More

quantitative-methods

Sampling Error

Sampling error is the statistical error that occurs when an analyst selects a... Read More

quantitative-methods

Calculating Probabilities from Cumulat ...

A cumulative distribution function, \(F(x)\), gives the probability that the random variable \(X\)... Read More

quantitative-methods

Hypothesis Testing

A hypothesis is an assumptive statement about a problem, idea, or some other... Read More