Regarding the chi-square test of independence, which statement is accurate? A. It is a parametric hypothesis test. B. It is used to test whether two categorical variables are related to each other. C. It is used to test whether two continuous variables are related to each other.

The correct answer is B. The chi-square test of independence is a non-parametric hypothesis test used to determine whether two categorical variables are related. Option A is incorrect because the test is non-parametric. Option C is incorrect because the chi-square test of independence is not used for continuous variables.

An analyst is studying the relationship between returns for two sectors, steel and cement, over the past 5 years using Spearman’s rank correlation coefficient. The returns are: Year 1: Steel 2.5%, Cement 3.2% Year 2: Steel 5.0%, Cement 4.5% Year 3: Steel 5.6%, Cement 4.2% Year 4: Steel −3.0%, Cement −1.7% Year 5: Steel 0.5%, Cement 1.1% The Spearman’s rank correlation coefficient is closest to: A. 0.5 B. 0.6 C. 0.8

The correct answer is C. Using rS = 1 − [6∑d²]/[n(n²−1)] with n = 5 and ∑d² = 4 gives rS = 1 − (24/120) = 0.8.

Which of the following statements about the log-lin model is most likely correct? A. The dependent variable is linear, while the independent variable is logarithmic. B. Both the dependent and independent variables are logarithmic. C. The dependent variable is logarithmic, while the independent variable is linear.

The correct answer is C. In a log-lin model, the dependent variable is expressed in logarithmic form while the independent variable remains in its linear form. The model is written as ln(Y) = b0 + b1X. By contrast, a lin-log model has a linear dependent variable and a logarithmic independent variable, while a log-log model expresses both the dependent and independent variables in logarithmic form.

How are R² and the F-statistic calculated from an ANOVA table?

R² is calculated as the sum of squares for regression divided by the total sum of squares. Using the ANOVA table, R² equals 1,701,563 divided by 1,808,363, which is approximately 0.94 or 94%. The F-statistic is calculated as the mean regression sum of squares divided by the mean squared error. This equals 1,701,563 divided by 13,350, resulting in an F-statistic of approximately 127.

What does an R² value of 94% indicate in a regression model?

An R² value of 94% indicates that 94% of the total variation in the dependent variable is explained by the regression model. This suggests a strong explanatory power and a good overall fit of the model to the observed data.

What does a large F-statistic imply about the regression model?

A large F-statistic indicates that the regression model explains a significant portion of the variation in the dependent variable relative to unexplained variation. This provides strong evidence that the regression relationship is statistically significant.

Consider an ANOVA table where the regression sum of squares is 1,701,563, the total sum of squares is 1,808,363, the regression mean sum of squares is 1,701,563, and the mean squared error is 13,350. The value of R-squared and the F-statistic for the test of fit of the regression model are closest to: A. 6% and 16. B. 94% and 127. C. 99% and 127.

The correct answer is B. R-squared is calculated as SSR divided by SST, or 1,701,563 / 1,808,363 = 0.94, which equals 94%. The F-statistic is calculated as MSR divided by MSE, or 1,701,563 / 13,350 = 127.46, which is approximately 127.

A regression model with one independent variable requires several assumptions for valid conclusions. Which of the following statements most likely violates those assumptions? A. The independent variable is random. B. The error term is distributed normally. C. There exists a linear relationship between the dependent variable and the independent variable.

The independent variable is random. In classical linear regression, the independent variable is assumed to be non-random. This assumption ensures unbiased and consistent estimation of the regression coefficients. Normality of the error term and a linear relationship between the dependent and independent variables are standard assumptions that support valid hypothesis testing and model specification.

CFA Quantitative Methods | AnalystPrep

Data Presentation as a Histogram or a Frequency Polygon

Ngari Joseph — Thu, 21 Nov 2024 02:43:58 +0000

Histogram

A histogram shows the distribution of numerical data in the form of a graph. However, it is very similar to a bar chart, a histogram groups data into intervals. To construct a histogram, you need to establish all the intervals of data, commonly known as bins. The intervals should capture all the data points and also be non-overlapping.

The intervals appear on the horizontal axis, while the absolute frequencies appear on the vertical axis. For a histogram with equal intervals in size, a rectangle should be erected over the interval, with its height being proportional to the absolute frequency. If intervals are unequal in size, the erected rectangle has an area proportional to the absolute frequency of that particular interval. We would have the vertical axis labeled as ‘density’ instead of frequency in such a case. There should be no space between bars to indicate that the intervals are continuous.

Strengthen your CFA Level I histogram and frequency polygon skills with our Free Trial.

Example 1: Histogram

Consider the previous example of the returns offered by a stock. To bring you up to speed, these were the intervals and the corresponding frequencies:

$$ \begin{array}{c|c} \textbf{Interval} & \textbf{Tally} & \textbf{Frequency} \\ \hline -30\% \leq R_t \leq -20\% & \text{II} & \text{2} \\ -20\% \leq R_t \leq -10\% & \text{I} & \text{1} \\ -10\% \leq R_t \leq 0\% & \text{III} & \text{3} \\ 0\% \leq R_t \leq 10\% & \text{IIIIII} & \text{6} \\ 10\% \leq R_t \leq 20\% & \text{IIIIIII} & \text{7} \\ 20\% \leq R_t \leq 30\% & \text{IIIII} & \text{5} \\ 30\% \leq R_t \leq 40\% & \text{I} & \text{1} \\ \textbf{Total} & \text{} & \textbf{25} \\ \end{array} $$

Frequency Polygon

It is also used to represent the distribution of data graphically. However, it has a major difference when compared to the histogram. Instead of having the class intervals on the horizontal axis clearly showing their upper and lower limits, a frequency polygon uses the midpoints of the class intervals.

$$ \text{Midpoint of a class interval} =\text {Lower limit} + \cfrac { (\text{Upper limit} – \text{Lower limit}) }{ 2 } $$

The vertical axis features the absolute frequencies, which are then joined using straight lines and markers.

Example 2: Frequency Polygon

Going back to the stock return data, we could come up with a frequency polygon.

To come up with the midpoints, we use the formula above. As an example, the midpoint of the interval -30% ≤ R_t ≤ -20% is:

$$ \text{Midpoint} = -30 + \cfrac {(-20 – – 30)}{2} = -25 $$

We can calculate the midpoints for the other intervals in a similar manner. The final frequency polygon should look like this:

The frequency polygon is important because it shows the shape of a distribution of data. It can also be very useful when comparing two sets of data side-by-side.

Note: The endpoints touch the X-axis. The vertical scale can also be positioned at the left margin.

Start Free Trial →

Access CFA Level I quantitative methods study notes, practice questions, mock exams, and video lessons to strengthen your understanding of histograms, frequency polygons, and data visualization.

The post Data Presentation as a Histogram or a Frequency Polygon appeared first on AnalystPrep | CFA® Exam Study Notes.

Tests of Independence Using Contingency Table Data

Kajal — Sat, 26 Aug 2023 02:23:41 +0000

With categorical or discrete data, correlation is not suitable for assessing relationships between variables. Instead, we use a non-parametric test called the chi-square test of independence, which employs a chi-square distributed test statistic.

We employ a contingency table to structure the data when examining the connection between two categorical variables. Subsequently, we apply a test of independence utilizing a chi-square distribution to assess whether a noteworthy relationship exists between these variables. The test statistic is calculated as follows:

$$ \chi=\sum_{i=1}^{m}\frac{\left(O_{ij}-E_{ij}\right)^2}{\left(E_{ij}\right)} $$

Where:

$E_{ij}=\frac{\left(\text{Total row i}\right)\times\left(\text{Total column j}\right)}{\text{Overall Total}} $

$m$= Number of cells in the table, the Number of groups in the first class, multiplied by the number of groups in the second class.

$O_{ij}$= Number of observations in each cell of row $i$ and column $j$ (i.e., observed frequency).

$E_{ij}$= Expected number of observations in each cell of row $i$ and column $j$, assuming independence (i.e., expected frequency).

The degrees of freedom are given by:

$$ \text{Degrees of freedom}=(r-1)(c-1) $$

Where:

$r$= Number of rows.

$c$= Number of columns.

Strengthen your CFA Level I tests of independence skills with our Free Trial.

Example: Testing Independence Based on Contingency Table Data

The following contingency table shows the responses of two categories of investors (employed vs. retired) with regard to their primary investment objectives (growth, income, or both). The total sample size is 173.

$$ \begin{array}{c|c|c|c|c}
& \textbf{Growth} & \textbf{Income} & \textbf{Both} & \textbf{Total} \\ \hline
\text{Employed} & 52 & 25 & 10 & \bf{87} \\ \hline
\text{Retired} & 32 & 47 & 7 & \bf{86} \\ \hline
\textbf{Total} & \bf{84} & \bf{72} & \bf{17} & \bf{173}
\end{array} $$

Use a 95% significance level to test whether there is any significant difference between employed and retired investors concerning primary investment objectives.

Solution

$H_0$: There is no significant difference between employed and retired investors with regard to primary investment objectives.

$H_\alpha$: There is a significant difference between employed and retired investors with regard to primary investment objectives.

Step 1: We calculate the expected frequency of investors by their category (employed vs. retired) and investment objective using the following formula:

$$ E_{ij}=\frac{\left(\text{Total row i}\right)\times\left(\text{Total column j}\right)}{\text{Overall Total} } $$

$$ \begin{array}{c|c|c|c|c}
& \textbf{Growth} & \textbf{Income} & \textbf{Both} & \textbf{Total} \\ \hline
\text{Employed} & {\frac{\left(87\times84\right)}{173} =42.24} & {\frac{\left(87\times72\right)}{173} =36.20} & {\frac{\left(87\times17\right)}{173} =8.55} &\bf{87} \\ \hline
\text{Retired} & {\frac{\left(86\times84\right)}{173} =41.75} & {\frac{\left(86\times72\right)}{173} =35.79} & {\frac{\left(86\times17\right)}{173} =8.45} & \bf{86} \\ \hline
\textbf{Total} & \bf{84} & \bf{72} & \bf{17} & \bf{173}
\end{array} $$

Step 2: We calculate the scaled squared deviation for each combination of investor category and investment objective as follows:

$$
\begin{array}{c|c|c|c}
& \textbf{Growth} & \textbf{Income} & \textbf{Both} \\ \hline
\text{Employed}
& \frac{(52-42)^2}{42}=2.381
& \frac{(25-36)^2}{36}=3.361
& \frac{(10-9)^2}{9}=0.111 \\ \hline
\text{Retired}
& \frac{(32-42)^2}{42}=2.381
& \frac{(47-36)^2}{36}=3.361
& \frac{(7-8)^2}{8}=0.125 \\ \hline
\textbf{Total}
& \bf{4.762}
& \bf{6.722}
& \bf{0.236}
\end{array}
$$

Step 3: We calculate the value of $\chi^2$:

$$ \chi^2=4.762+6.722+0.236=11.72 $$

Step 4: The critical value of $X^2$ is 5.99. It is determined as follows:

There are $(r-1)(c-1)=(2-1)\times(3-1)=2$ degrees of freedom.
It is a one-sided test with a 5% level of significance.

Decision rule: The calculated value of $\chi^2 =11.72$ is greater than the critical value of 5.99. As such, sufficient evidence supports the conclusion that retired and employed investors have different primary investment objectives.

Question

Regarding the chi-square test of independence, which statement is accurate? The chi-square test of independence is:

A parametric hypothesis test.

Used to test whether two categorical variables are related to each other.

Used to test whether two continuous variables are related to each other.

Solution

The correct answer is B. The chi-square test of independence is a non-parametric hypothesis test that can be used to test whether two categorical variables are related.

A is incorrect because the chi-square test of independence is non-parametric, not parametric.

C is incorrect because the chi-square test of independence is used for categorical variables, not continuous variables.

Start Free Trial →

Access CFA Level I quantitative methods study notes, practice questions, mock exams, and video lessons to strengthen your understanding of tests of independence and contingency table analysis.

The post Tests of Independence Using Contingency Table Data appeared first on AnalystPrep | CFA® Exam Study Notes.

Tests of Independence

Kajal — Fri, 25 Aug 2023 06:46:48 +0000

Parametric versus Non-parametric Tests of Independence

A parametric test is a hypothesis test concerning a population parameter used when the data has specific distribution assumptions. If these assumptions are not met, non-parametric tests are used.

In summary, researchers use non-parametric testing when:

Data do not meet distributional assumptions.
There are outliers.
Data is given in the form of ranks.
The hypothesis test objective does not concern a parameter.

Strengthen your CFA Level I independence testing skills with our Free Trial.

Hypotheses Concerning Population Correlation Coefficient

We frequently compare the population correlation coefficient to zero when testing for correlation. This helps us determine whether there’s a relationship between the variables. The population correlation coefficient, represented by $\rho$, is used to test the relationship. There are three possible hypotheses:

Two-sided; $H_0: \rho=0 \text{ versus } H_a: \rho\neq 0$.
One-sided right side; $H_0: \rho \le 0 \text{ versus } H_a: \rho \gt 0$.
One-sided left side; $H_0: \rho\geq0 \text{ versus } H_a: \rho \lt 0$.

Let’s assume that we have variables X and Y. The sample correlation, $r_{XY}$, tests the above hypotheses.

Parametric Test of a Correlation

The parametric pairwise correlation coefficient, also known as Pearson correlation, is used to test the correlation in a parametric test. The formula for the sample correlation involves the sample covariance between the $X$ and $Y$ variables and their respective standard deviations, which is expressed as:

$$ r=\frac{S_{XY}}{S_XS_Y} $$

Where:

$S_{XY}$= Sample covariance between the $X$ and $Y$ variables.

$S_X$= Standard deviation of the $X$ variable.

$S_Y$ = Standard deviation of the $Y$ variable.

A t-test can determine if the null hypothesis should be rejected using the sample correlation, $r$ if the two variables are normally distributed. The formula for the t-test is:

$$ t=\frac{r\sqrt{n-2}}{\sqrt{\left(1-r^2\right)} } $$

Where:

$r$= Sample correlation.

$n$= Sample size.

$\left(n-2\right)$= Degrees of freedom.

The test statistic follows a t-distribution with $n-2$ degrees of freedom. From the equation above, it is easy to see that the sample size, $n$, increases, and the degrees of freedom increase. In other words, as the sample size $n$ increases, the power of the test increases. This implies that a false null hypothesis is more likely to be rejected as the sample size increases.

Example: Parametric Test of a Correlation

The table below shows the sample correlations between the monthly returns of five different sector-specific exchange-traded funds (ETFs) and the overall market index (Market 1). There are 48 monthly observations, and the following ETFs are included in the analysis:

$$ \begin{array}{c|c|c|c|c|c|c}
& \text{ETF } 1 & \text{ETF } 2 & \text{ETF } 3 & \text{ETF } 4 & \text{ETF } 5 & \text{Market } 1 \\ \hline
\text{ETF } 1 & 1 \\ \hline
\text{ETF } 2 & 0.8214 & 1 \\ \hline
\text{ETF } 3 & 0.5672 & 0.6438 & 1 \\ \hline
\text{ETF } 4 & 0.4276 & 0.5789 & 0.4123 & 1 \\ \hline
\text{ETF } 5 & 0.7121 & 0.7942 & 0.6896 & 0.5614 & 1 \\ \hline
\text{Market } 1 & 0.8375 & 0.9096 & 0.7223 & 0.6954 & 0.7919 & 1
\end{array} $$

Using a 1% significance level and the following hypotheses: $H_0:\rho=0 \text{ versus } H_a:\rho\neq 0$, calculate the t-statistic for the correlation between ETF 2 and ETF 4. Based on the calculated t-statistic, draw a conclusion about the significance of the correlation using the following sample t-table:

$$\begin{array}{|lccccc}
\hline \text { df } & \boldsymbol{p}=\mathbf{0 . 1 0} & \boldsymbol{p}=\mathbf{0 . 0 5} & \boldsymbol{p}=\mathbf{0 . 0 2 5} & \boldsymbol{p}=\mathbf{0 . 0 1} & \boldsymbol{p}=\mathbf{0 . 0 0 5} \\
\hline \mathbf{3 1} & 1.309 & 1.696 & 2.040 & 2.453 & 2.744 \\
\mathbf{3 2} & 1.309 & 1.694 & 2.037 & 2.449 & 2.738 \\
\mathbf{3 3} & 1.308 & 1.692 & 2.035 & 2.445 & 2.733 \\
\mathbf{3 4} & 1.307 & 1.691 & 2.032 & 2.441 & 2.728 \\
\mathbf{3 5} & 1.306 & 1.690 & 2.030 & 2.438 & 2.724 \\
\mathbf{3 6} & 1.306 & 1.688 & 2.028 & 2.434 & 2.719 \\
\mathbf{3 7} & 1.305 & 1.687 & 2.026 & 2.431 & 2.715 \\
\mathbf{3 8} & 1.304 & 1.686 & 2.024 & 2.429 & 2.712 \\
\mathbf{3 9} & 1.304 & 1.685 & 2.023 & 2.426 & 2.708 \\
\mathbf{4 0} & 1.303 & 1.684 & 2.021 & 2.423 & 2.704 \\
\mathbf{4 1} & 1.303 & 1.683 & 2.020 & 2.421 & 2.701 \\
\mathbf{4 2} & 1.302 & 1.682 & 2.018 & 2.418 & 2.698 \\
\mathbf{4 3} & 1.302 & 1.681 & 2.017 & 2.416 & 2.695 \\
\mathbf{4 4} & 1.301 & 1.680 & 2.015 & 2.414 & 2.692 \\
\mathbf{4 5} & 1.301 & 1.679 & 2.014 & 2.412 & 2.690 \\
\mathbf{4 6} & 1.300 & 1.679 & 2.013 & 2.410 & 2.687 \\
\mathbf{4 7} & 1.300 & 1.678 & 2.012 & 2.408 & 2.685 \\
\mathbf{4 8} & 1.299 & 1.677 & 2.011 & 2.407 & 2.682
\end{array}$$

Solution

To test the significance of the correlation between ETF 2 and ETF 4, we will use the t-test formula:

$$ t=\frac{r\sqrt{n-2}}{\sqrt{\left(1-r^2\right)} } $$

Where:

$r$ = Sample correlation coefficient (in this case, $r_{EFT2,ETF4}=0.5789$).

$n$ = Number of observations (48 in this case).

Now, let’s calculate the t-statistic:

$$ t=\frac{r\sqrt{n-2}}{\sqrt{\left(1-r^2\right)} }=\frac{0.5789\sqrt{48-2}}{\sqrt{1-{0.5789}^2}}=6.0505 $$

The calculated t-statistic for the correlation between ETF2 and ETF4 is 6.0505.

At the 1% significance level, with a two-tailed test and degrees of freedom,

$df=n-2=46$, the critical t-value is approximately $\pm 2.687$.

Conclusion: We reject the null hypothesis since our calculated t-statistic (6.0505) is greater than the critical value (+2.687). This indicates sufficient evidence to suggest that the correlation between ETF 2 and ETF 4 significantly differs from zero.

Non-Parametric Test of Correlation: The Spearman Rank Correlation Coefficient

The Spearman rank correlation coefficient, $r_S$, is a non-parametric test used to examine the relationship between two data sets when the population deviates from normality.

The Spearman rank correlation coefficient is like the Pearson correlation coefficient. The difference is that the Spearman coefficient is calculated based on the ranks of variables in the samples.

Consider two variables, $X$ and $Y$. We need to calculate Spearman’s Rank Correlation $r_S$.

Steps of Calculating Spearman’s Rank Correlation Coefficient, $\bf{{r}_{S}}$

Rank the observations of each variable $X$ and $Y$ in descending order. Note that when there are tied values in the data, their ranks are calculated by taking the average of the ranks that would have been assigned to those values if they were not tied.
Find the difference between the ranks for each pair of observations.
Square the difference and calculate the sum of the difference, that is $\sum d_i$.
Use the following formula to find $r_S$:$$ r_s=1-\frac{6\sum_{i=1}^{n}d_i^2}{n\left(n^2-1\right)} $$Where;$d_i$=The difference between the ranks for each pair of observations$n$= Sample size.

Example: Calculating Spearman’s Rank Correlation Coefficient

An analyst is studying the relationship between returns for two sectors, steel and cement, over the past 5 years by using Spearman’s rank correlation coefficient. The hypotheses are $H_0: r_S=0$ and $H_a:r_S\neq0$. The returns of both sectors are provided below.

$$ \begin{array}{c|c|c}
\text{Year} & \text{Steel sector returns} & \text{Cement sector returns} \\ \hline
1 & 10\% & 8\% \\ \hline
2 & 6\% & 7\% \\ \hline
3 & 9\% & 5\% \\ \hline
4 & 12\% & 6\% \\ \hline
5 & 8\% & 9\%
\end{array} $$

The Spearman’s rank correlation coefficient is closest to:

Solution

$$ \begin{array}{c|c|c|c|c|c|c}
\textbf{Year} & {\text{Steel} \\ \text{sector} \\ \text{returns} \\ \text{(X)} } & { \text{Cement} \\ \text{sector} \\ \text{returns} \\ \text{(Y)} } & { \text{Rank} \\ \text{order} \\ \text{for X} } & { \text{Rank} \\ \text{order} \\ \text{for Y} } & D & {{d}^{2}} \\ \hline
1 & 10\% & 8\% & 2 & 2 & 0 & 0 \\ \hline
2 & 6\% & 7\% & 5 & 3 & 2 & 4 \\ \hline
3 & 9\% & 5\% & 3 & 5 & -2 & 4 \\ \hline
4 & 12\% & 6\% & 1 & 4 & -3 & 9 \\ \hline
5 & 8\% & 9\% & 4 & 1 & 3 & 9 \\ \hline
& & & & & \text{Sum}= & 26
\end{array} $$

We can now use the formula:

$$ \begin{align*} r_s & =1-\frac{6\sum_{i=1}^{n}d_i^2}{n\left(n^2-1\right)}=1-\left[\frac{\left(6\times26\right)}{5\times\left(5^2-1\right)}\right]=1-1.3 \\
r_s & =-0.3 \end{align*} $$

This indicates a very weak negative correlation between the returns of the steel and cement sectors.

Hypothesis Test for the Spearman Rank Correlation

The hypothesis test on the Spearman Rank depends on the sample size. If the sample size is small $(n\le30)$, we would need a specialized table of critical value. On the other hand, if the sample size is large $(n \gt 30)$, we can perform a t-test using the test statistic similar to that of Pearson correlation:

$$ t=\frac{r_s\sqrt{n-2}}{\sqrt{\left(1-r_s^2\right)} } $$

Consider the above example. Assume we want to conduct a hypothesis test at a 5% significance level. The hypotheses statement is $H_0: r_S=0$ and $H_a: r_S\neq 0$

Question 1

Assume an investment analyst, John Smith, is studying the relationship between two stocks, $X$ and $Y$. Based on 100 observations, he has found that $S_{XY} = 10, S_X= 2,$ and $S_Y=8$. Smith needs to find the sample correlation $r_{XY}$ and use it to perform a t-test to determine if there is a significant correlation between the returns of stocks $X$ and $Y$. The critical value for the test statistic at the 0.05 level of significance is approximately 1.96. He should conclude that the statistical relationship between $X$ and $Y$ is:

Significant because the test statistic falls outside the range of the critical values.

Significant, because the absolute value of the test statistic is less than the critical value.

Insignificant because the test statistic falls outside the range of the critical values.

Solution

The correct answer is A.

Note that the sample correlation coefficient, $r_{XY}$ is calculated using the following formula:

$$ r_{XY}=\frac{S_{XY}}{S_XS_Y} $$

Substituting the given values in this formula, we get:

$$ r_{XY}=\frac{10}{2\times8}=0.625 $$

To test the significance of the sample correlation, we can use a t-test with the following null and alternative hypotheses: $H_0=\rho=0$ and $H_\propto=\rho\neq0$

The test statistic for this test is calculated using the following formula:

$$ t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}} $$

Where:

$r$ = Sample correlation coefficient.

$n$ = The Number of observations.

Substituting the given values into this formula, we get:

$$ t=\frac{0.625\sqrt{100-2}}{\sqrt{1-{0.625}^2}}\approx\frac{6.1872}{0.7806}=7.9262 $$

The critical value for the test statistic at the 0.05 level of significance is approximately 1.96.

Since our calculated test statistic (7.9262) is greater than the upper bound of the critical values for the test statistic (1.96), we reject the null hypothesis. This indicates sufficient evidence to suggest that the correlation between X and Y is significantly different from zero.

Therefore, John Smith should conclude that the statistical relationship between $X$ and $Y$ is significant because the test statistic falls outside the range of the critical values (Option A).

Question 2

An analyst is studying the relationship between returns for two sectors, steel and cement, over the past 5 years by using Spearman’s rank correlation coefficient. The returns of both sectors are provided below.

$$ \begin{array}{c|c|c}
\textbf{Year} & \textbf{Steel Sector Returns} & \textbf{Cement Sector Returns} \\ \hline
1 & 2.5\% & 3.2\% \\ \hline
2 & 5\% & 4.5\% \\ \hline
3 & 5.6\% & 4.2\% \\ \hline
4 & -3\% & -1.7\% \\ \hline
5 & 0.5\% & 1.1\%
\end{array} $$

The Spearman’s rank correlation coefficient is closest to:

0.5

0.6

0.8

Solution

$$ \begin{array}{c|c|c|c|c|c|c}
\textbf{Year} & \bf{\text{Steel} \\ \text{Sector} \\ \text{Returns (X)} } & \bf{ \text{Cement} \\ \text{Sector} \\ \text{Returns (Y)} } & \bf{\text{Rank} \\ \text{of X} } & \bf{ \text{Rank} \\ \text{of Y} } & \bf d & \bf{d^2} \\ \hline
1 & 2.5\% & 3.2\% & 3 & 4 & -1 & 1 \\ \hline
2 & 5\% & 4.5\% & 2 & 1 & 1 & 1 \\ \hline
3 & 5.6\% & 4.2\% & 1 & 2 & -1 & 1 \\ \hline
4 & -3\% & -1.7\% & 5 & 5 & 0 & 0 \\ \hline
5 & 0.5\% & 1.1\% & 3 & 3 & 1 & 1 \\ \hline
& & & & & \textbf{Sum} & \bf 4
\end{array} $$

We now use the formula:

$$ \begin{align*} r_s & =1-\frac{6\sum_{i=1}^{n}d_i^2}{n\left(n^2-1\right)} \\ & =1-\frac{6\times4}{5\left(5^2-1\right)} \\ & =0.8 \end{align*} $$

Start Free Trial →

Access CFA Level I quantitative methods study notes, practice questions, mock exams, and video lessons to strengthen your understanding of tests of independence and statistical decision-making.

The post Tests of Independence appeared first on AnalystPrep | CFA® Exam Study Notes.

Functional Forms for Simple Linear Regression

Kajal — Sat, 19 Aug 2023 10:47:04 +0000

To address non-linear relationships, we employ various functional forms to potentially convert the data for linear regression. Here are three commonly used log transformation functional forms:

Log-lin model: In this log transformation, the dependent variable is logarithmic, while the independent variable is linear. It is represented as shown below.$$ lnY=b_0+b_1X_i. $$
The slope coefficient in the log-lin model is the relative change in the dependent variable for an absolute change in the independent variable.

When utilizing a log-lin model, caution must be exercised when making forecasts. For example, in the predicted regression equation like $Y=-3+5X$, if X is equal to 1, the $ln{Y}=-3$, then,

$$ Y=e^{-3}=0.0498 $$

Moreover, the lin-lin model cannot be compared with the log-lin model without the transformation. As such, we need to transform $R^2$ and F-statistic.
Lin-log model: In this case, the dependent variable is linear, while the independent variable is logarithmic. It is represented as follows:
$Y_i=b_0+b_1lnX_i$.

The slope coefficient in the lin-log model is responsible for the absolute change in the dependent variable for a relative change in the independent variable.
Log-log model: In this log transformation, both the dependent and independent variables are logarithmic. It is represented as $lnY_i=b_0+b_1lnX_i$. The slope coefficient in the log-log model is the relative change in the dependent variable for a relative change in the independent variable. In other words, if X increases by 1%, Y will change by $b_1$.

Selecting the Correct Functional Form

To settle on the correct functional form, consider the following goodness of fit measures:

Coefficient of determination $(R^2)$. A high value is better.
F-statistic. The high value of the F-statistic is better.
Standard error of the estimate $(S_e)$. A low value of $S_e$ is better.

In addition to the factors cited above, the patterns in residuals can also be analyzed when evaluating a model. In a good model, residuals are random and uncorrelated.

Question 1

Which of the following statements about the log-lin model is most likely correct:

The dependent variable is linear, while the independent variable is logarithmic.

Both the dependent and independent variables are logarithmic

The dependent variable is logarithmic, while the independent variable is linear.

The correct answer is c.

In the log-lin model, the dependent variable ($Y$) is logarithmic, as represented by $$lnY = b_{0} + b_{1}X_{i}$$ While the independent variable ($X$) is linear.

A is incorrect. It describes the lin-log model, where the dependent variable is linear and the independent variable is logarithmic.

B is incorrect. It describes the log-log model, where both the dependent and independent variables are logarithmic.

Start Free Trial →

Access CFA Level I quantitative methods study notes, practice questions, mock exams, and video lessons to strengthen your understanding of functional forms for simple linear regression.

The post Functional Forms for Simple Linear Regression appeared first on AnalystPrep | CFA® Exam Study Notes.

Predicted Value and Prediction Interval of a Dependent Variable

Kajal — Sat, 19 Aug 2023 09:52:46 +0000

We calculate the predicted value of the dependent variable, $Y$, by inserting the estimated value of the independent variable, $X$, into the regression equation. The predicted value of the dependent variable, $Y$, is determined using the following formula:

$$\hat{Y}=\hat{b}_0+\hat{b}_1X$$

Where:

$\hat{Y}$ = Predicted value of the dependent variable.

$X$ = Estimated value of the independent variable.

Example: Calculating the Predicted Value of a Dependent Variable

Refer to the example of regressed inflation rates against unemployment rates from 2011 to 2020.

$$ \begin{array}{c|c|c|c|c|c|c|c}
\text{Year} & \text{Unemployment} & \text{Inflation} & \text{Predicted} & \text{Variation} & \text{Variation} & \text{Variation} & (X_i \\
& {\text{Rate } \% (X_i)} & {\text{Rate }\%} & \text{Unemployment} & \text{to be} & \text{Unexplained} & \text{Explained} & -\bar{X})^2 \\
& & ({{Y}}_i) & {\text{rate } (\hat Y_i)} & \text{Explained.} & & & \\
& & & & \left(Y_i-\bar{Y}\right)^2 & \left(Y_i- \hat{Y}_i\right)^2 & \left({\hat{Y}}_i-\bar{Y}\right)^2 & \\ \hline
2011 & 6.1 & 1.7 & 1.610 & 0.410 & 0.008 & 0.533 & 0.656 \\ \hline
2012 & 7.4 & 1.2 & 0.437 & 1.300 & 0.582 & 3.621 & 4.452 \\ \hline
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ \hline
2019 & 4.0 & 4.7 & 3.504 & 5.570 & 1.430 & 1.355 & 1.664 \\ \hline
2020 & 3.9 & 3.6 & 3.594 & 1.588 & 0.000 & 1.573 & 1.932 \\ \hline
\textbf{Sum} & \bf{52.90} & \bf{23.4} & & \bf{13.704} & \bf{3.136} & \bf{10.568} & \bf{12.989} \\ \hline
\textbf{Arithmetic} & \bf{5.29} & \bf{2.34} & & & & & \\
\textbf{Mean} & & & & & & & \\
\end{array} $$

The estimated regression model is illustrated below.

$$ \hat{Y}=7.112-0.9020X_i+\varepsilon_i $$

Calculate the predicted inflation rate value if the forecasted value of the unemployment rate is 4.5%.

Solution

The predicted value of the inflation rate is determined as follows:

$$ \hat{Y}=7.112-0.9020\times4.5=3.053\% $$

Strengthen your CFA Level I prediction interval skills with our Free Trial.

Confidence Interval for Predicted Values

The confidence interval calculation for the predicted value of a dependent variable is the same as that of the confidence interval for regression coefficients. The confidence interval for a predicted value of the dependent variable is given by:

$$\text{Prediction Interval}=\ \hat{Y}\pm t_cs_f$$

Where:

$t_c$= Two-tailed critical t-value at the given significance level with $n – 2$ df.

$\hat{Y}$ = Predicted value of a dependent variable.

$s_f^2$= The estimated variance of the prediction error.

$$ s_f^2=s_e^2\left[1+\frac{1}{n}+\frac{\left(X_f-\bar{X}\right)^2}{\left(n-1\right)s_x^2}\right]=s_e^2\left[1+\frac{1}{n}+\frac{\left(X_f-\bar{X}\right)^2}{\sum_{i\ =\ 1}^{n}\left(X_i-\bar{X}\right)^2}\right] $$

Where:

$s_e^2$ = The squared standard error of the estimate.

$n$ = Number of observations.

$s_X^2$= Variance of the independent variable.

$X_f$ = Value of the independent variable.

We can, therefore, calculate the standard error of forecast as shown below:

$$s_f=s_e\sqrt{1+\frac{1}{n}+\frac{\left(X_f-\bar{X}\right)^2}{\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2}}$$

From the formula above, we can observe that:

A better fit of the regression analysis leads to a smaller standard error of the estimate $(s_e)$, subsequently resulting in a lower standard error of the forecast.
When the sample size $(n)$ in the regression calculation increases, it directly corresponds to a reduction in the standard error of the forecast.
If the forecasted independent variable $(X_f)$ approaches the mean of the independent variable $(\bar{X})$ utilized in the regression analysis, it decreases the standard error of the forecast.

Example: Calculating the Confidence Interval of the Predicted Value

Refer to the example of regressed inflation rates against unemployment rates from 2011 to 2020.

Consider the results of the regression analysis of inflation rates on unemployment rates:

$$ \begin{array}{lcccc}
\bf{\textit{Regression Statistics} } & & & & \\ \hline
\text{R Square} & 0.7711 & & & \\
\text{Standard Error} & 0.6261 & & & \\
\text{Observations} & 10 & & & \\ \hline \\ \hline
\text{ANOVA} & & & & \\ \hline
& \textbf{df} & \textbf{Sum of} & \textbf{Mean} & \textbf{F} \\
& & \textbf{Squares} & \textbf{Square} & \\ \hline
\text{Regression} & 1 & 10.568 & 10.568 & 26.9565 \\
\text{Residual} & 8 & 3.136 & 0.392 & \\
\text{Total} & 9 & 13.704 & & \\ \hline
\\ \hline
& \textbf{Coefficients} & \textbf{Standard} & \textbf{t Stat} & \textbf{p-value} \\
& & \textbf{Error} & & \\ \hline
\text{Intercept} & 7.112 & 0.940 & 7.565 & 0.000 \\
\text{Unemployment} & -0.902 & 0.174 & -5.192 & 0.001 \\
\text{rate (%)} & & & & \\ \hline
\end{array} $$

Given that the forecasted unemployment rate is 4.5%, calculate the 95% confidence interval for the predicted inflation rate value.

Solution

$$\text{Prediction Interval} = \hat{Y} \pm t_{c}s_{f}$$

The estimated variance of the prediction error is:

$$ \begin{align*}
s_f^2 & =s_e^2\left[1+\frac{1}{n}+\frac{\left(X_f-\bar{X}\right)^2}{\left(n-1\right)s_X^2}\right] \\ & =s_e^2\left[1+\frac{1}{n}+\frac{\left(X_f-\bar{X}\right)^2}{\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2}\right] \\ & ={0.6261}^2\left[1+\frac{1}{10}+\frac{\left(4.5-5.29\right)^2}{12.989}\right]=0.450 \end{align*} $$

As such, the standard error of forecast is:

$$ s_f=\sqrt{0.450}=0.6708 $$

The predicted value of the inflation rate given an unemployment rate of 4.5% is:

$$ \hat{Y}=7.112-0.9020\times4.5=3.05\% $$

The two-tailed critical t-value with 8 $(n-2)$ degrees of freedom at the 5% significance level is 2.306.

The prediction interval at the 95% confidence level is:

$$\text{Prediction Interval (PI)} = \hat{Y} \pm t_{c}s_{f}$$

$$\text{PI} = 3.05 \pm 2.306\times 0.6708= 1.50\% \text{ to } 4.60\%$$

Interpretation

Given an unemployment rate of 4.5%, we are 95% confident that the inflation rate will lie between 1.50% and 4.60%.

Question 1

The regression equation of the quantity of goods against the price is given by:

$$Y =-159+0.26X$$

Where:

$Y$ = Quantity supplied.

$X$ = Price per unit of the product.

The predicted value of the quantity supplied when the price equals 1,200 is closest to:

153.

155.

471.

The correct answer is A.

$$Y = -159 + 0.26\times1,200=153$$

Start Free Trial →

Practice regression predictions, confidence intervals, and model interpretation with CFA Level I exam-style questions.

The post Predicted Value and Prediction Interval of a Dependent Variable appeared first on AnalystPrep | CFA® Exam Study Notes.

Analysis of Variance (ANOVA)

Kajal — Fri, 18 Aug 2023 11:06:19 +0000

The sum of squares of a regression model is usually represented in the Analysis of Variance (ANOVA) table. The ANOVA table contains the sum of squares (SST, SSE, and SSR), the degrees of freedom, the mean squares (MSR and MSE), and F-statistics.

The typical format of ANOVA is as shown below:

$$ \begin{array}{c|c|c|c|c}
\textbf{Source} & \textbf{Sum of Squares} & \textbf{Degrees} & \textbf{Mean} & \textbf{F-statistic} \\
& & \textbf{of} & \textbf{square} & \\
& & \textbf{Freedom} & & \\ \hline
{ \text{Regression} \\ \text{(Explained)} } & SSR=\sum_{i=1}^{n}\left(\widehat{Y_i}-\bar{Y}\right)^2 &
1 & MSR=\frac{SSR}{1} & F=\frac{MSR}{MSE} \\ \hline
{ \text{Residual} \\ \text{(explained)} } & SSE=\sum_{i=1}^{n}\left(Y_i-\widehat{Y_i}\right)^2 & n-2 & MSE=\frac{SSE}{n-2} & \\ \hline
\text{Total} & SST=\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)^2 & n-1 & &
\end{array} $$

Standard Error of EstimateStandard Error of Estimate, $S_e$ or $SEE$, is referred to as the root mean square error or standard error of the regression. It measures the distance between the observed and dependent variables predicted by the regression model. The Standard Error of Estimate is easily calculated from the ANOVA table using the following formula:

$${\text{Standard Error of Estimate }}(S_e)=\sqrt{MSE}=\sqrt{\frac{\sum_{i = 1}^{n}\left(Y_i-\hat{Y}\right)^2}{n-2}}$$

The standard error of estimate, coefficient of determination, and F-statistic are the measures that can be used to gauge the goodness of fit of a regression model. In other words, these measures are used to tell the extent to which a regression model syncs with data.

The smaller the Standard Error of Estimate is, the better the fit of the regression line. However, the Standard Error of Estimate does not tell us how well the independent variable explains the variation in the dependent variable.

Strengthen your CFA Level I ANOVA concepts with our Free Trial.

Example: Calculating and Interpreting F-Statistic

The completed ANOVA table for the regression model of the inflation rate against the unemployment rate over 10 years is given below:

$$ \begin{array}{c|c|c|c|c}
\textbf{Source} & \textbf{Sum of} & \textbf{Degrees of} & \textbf{Mean Sum} & \textbf{F-Statistic} \\
& \textbf{Squares} & \textbf{Freedom} & \textbf{of Squares} & \\ \hline
\text{Regression} & 10.568 & 1 & 10.568 & ? \\ \hline
\text{Error} & 3.136 & 8 & 0.392 \\ \hline
\text{Total} & 13.704 & 9
\end{array} $$

Use the above ANOVA table to calculate the F-statistic.
Test the hypothesis that the slope coefficient equals a 5% significance level.

Solution

$ =\frac{{\text{Mean Regression Sum of Squares (MSR)}}}{{\text{Mean Squared Error(MSE)}}}=\frac{10.568}{0.392}=26.960$
We are testing the null hypothesis $H_0:b_1=0$ against the alternative hypothesis $H_1: b_1 \neq 0$. The critical F-value for $k = 1$ and $n-2 = 8$ degrees of freedom at a 5% significance level is roughly 5.32. Note that this is a one-tail test, so we use the 5% F-table. Remember that the null hypothesis is rejected if the calculated value of the F-statistic is greater than the critical value of F. Since $26.960 \gt 5.32$, we reject the null hypothesis and conclude that the slope coefficient is significantly different from zero. Notice that we also rejected the null hypothesis in the previous examples. We did so because the 95% confidence interval did not include zero.An F-test duplicates the t-test in regard to the slope coefficient significance for a linear regression model with one independent variable. In this case, $t^2={2.306}^2\approx 5.32$. Since the F-statistic is the square of the t-statistic for the slope coefficient, its inferences are the same as the t-test. However, this is not the case for multiple regressions.

Question

Consider the following analysis of variance (ANOVA) table:

$$
\begin{array}{c|c|c|c}
\textbf {Source} & \textbf {Degrees of} & \textbf { Sum of } & \textbf {Mean Sum} \\
& \textbf{Freedom} & \textbf {Squares} & \textbf{of Squares} \\ \hline
\text {Regression} & 1 & 1,701,563 & 1,701,563 \\ \hline
\text {Error} & 3 & 106,800 & 13,350 \\
\text {(Unexplained)} & & & \\ \hline
\text {Total} & 4 & 1,808,363 & \\
\end{array}
$$

The value of $R^2$ and the F-statistic for the test of fit of the regression model are closest to:

6% and 16.

94% and 127.

99% and 127.

Solution

The correct answer is B.

$$R^2=\frac{\text{Sum of Squares Regression (SSR)}}{\text{Sum of Squares Total (SST)}}=\frac{1,701,563}{1,808,363}=0.94=94\%$$

$$ \begin{align*} F & =\frac{\text{Mean Regression Sum of Squares (MSR)}}{\text{Mean Squared Error (MSE)}} \\ & =\frac{1,701,563}{13,350} =127.46\approx 127 \end{align*}$$

Start Free Trial →

Access CFA Level I quantitative methods study notes, practice questions, mock exams, and video lessons to strengthen your understanding of analysis of variance and regression model interpretation.

The post Analysis of Variance (ANOVA) appeared first on AnalystPrep | CFA® Exam Study Notes.

Measures of Fit and Hypothesis Tests of Regression Coefficients

Kajal — Thu, 17 Aug 2023 05:33:27 +0000

The sum of Squares Total (SST) and Its Components

The sum of Squares Total (total variation) is a measure of the total variation of the dependent variable. It is the sum of the squared differences of the actual y-value and mean of y-observations.

$$ SST=\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)^2 $$

The Sum of Squares Total contains two parts:

The Sum of Square Regression (SSR).
The sum of Squares Error (SSE).

The sum of Squares Regression (SSR): The sum of squares regression measures the explained variation in the dependent variable. It is given by the sum of the squared differences of the predicted y-value ${\hat{Y}}_i$, and mean of y-observations, $\bar{Y}$:$$ SSR=\sum_{i=1}^{n}\left({\hat{Y}}_i-\bar{Y}\right)^2 $$
The Sum of Squared Errors (SSE): The sum of squared errors is also called the residual sum of squares. It is defined as the variation of the dependent variable unexplained by the independent variable. SSE is given by the sum of the squared differences of the actual y-value $(Y_i)$ and the predicted y-values, ${\hat{Y}}_i$.$$ {SSE}=\sum_{i=1}^{n}\left(Y_i-{\hat{Y}}_i\right)^2 $$Therefore, the sum of squares total is given by:$$ \begin{align*} \text{Sum of Squares Total} & ={\text{Explained Variation} + \text{Unexplained Variation}} \\ & ={SSR+ SSE} \end{align*} $$

The components of the total variation are shown in the following figure.

For example, consider the following table. We wish to use linear regression analysis to forecast inflation, given unemployment data from 2011 to 2020.

$$ \begin{array}{c|c|c}
\text{Year} & {\text{Unemployment Rate } (\%)} & {\text{Inflation Rate } (\%)} \\ \hline
2011 & 6.1 & 1.7 \\ \hline
2012 & 7.4 & 1.2 \\ \hline
2013 & 6.2 & 1.3 \\ \hline
2014 & 6.2 & 1.3 \\ \hline
2015 & 5.7 & 1.4 \\ \hline
2016 & 5.0 & 1.8 \\ \hline
2017 & 4.2 & 3.3 \\ \hline
2018 & 4.2 & 3.1 \\ \hline
2019 & 4.0 & 4.7 \\ \hline
2020 & 3.9 & 3.6
\end{array} $$

Remember that we had estimated the regression line to be $\hat{Y}=7.112-0.9020X_i+\varepsilon_i$. As such, we can create the following table:

$$ \begin{array}{c|c|c|c|c|c|c|c}
\text{Year} & \text{Unemployment} & \text{Inflation} & \text{Predicted} & \text{Variation} & \text{Variation} & \text{Variation} & (X_i \\
& {\text{Rate } \% (X_i)} & {\text{Rate }\%} & \text{Inflation} & \text{to be} & \text{Unexplained} & \text{Explained} & -\bar{X})^2 \\
& & ({{Y}}_i) & {\text{rate } (\hat Y_i)} & \text{Explained.} & & & \\
& & & & \left(Y_i-\bar{Y}\right)^2 & \left(Y_i- \hat{Y}_i\right)^2 & \left({\hat{Y}}_i-\bar{Y}\right)^2 & \\ \hline
2011 & 6.1 & 1.7 & 1.610 & 0.410 & 0.008 & 0.533 & 0.656 \\ \hline
2012 & 7.4 & 1.2 & 0.437 & 1.300 & 0.582 & 3.621 & 4.452 \\ \hline
2013 & 6.2 & 1.3 & 1.520 & 1.082 & 0.048 & 0.673 & 0.828 \\ \hline
2014 & 6.2 & 1.3 & 1.520 & 1.082 & 0.048 & 0.673 & 0.828 \\ \hline
2015 & 5.7 & 1.4 & 1.971 & 0.884 & 0.326 & 0.136 & 0.168 \\ \hline
2016 & 5.0 & 1.8 & 2.602 & 0.292 & 0.643 & 0.069 & 0.084 \\ \hline
2017 & 4.2 & 3.3 & 3.324 & 0.922 & 0.001 & 0.967 & 1.188 \\ \hline
2018 & 4.2 & 3.1 & 3.324 & 0.578 & 0.050 & 0.967 & 1.188 \\ \hline
2019 & 4.0 & 4.7 & 3.504 & 5.570 & 1.430 & 1.355 & 1.664 \\ \hline
2020 & 3.9 & 3.6 & 3.594 & 1.588 & 0.000 & 1.573 & 1.932 \\ \hline
\textbf{Sum} & \bf{52.90} & \bf{23.4} & & \bf{13.704} & \bf{3.136} & \bf{10.568} & \bf{12.989} \\ \hline
\textbf{Arithmetic} & \bf{5.29} & \bf{2.34} & & & & & \\
\textbf{Mean} & & & & & & & \\
\end{array} $$

From the table above, we can calculate the following:

$$ \begin{align*}
SST & =\sum_{i=1}^{n}{\left(Y_i-\bar{Y}\right)^2=13.704} \\
SSR & =\sum_{i=1}^{n}\left({\hat{Y}}_i-\bar{Y}\right)^2 =10.568 \\
{SSE} & =\sum_{i=1}^{n}\left(Y_i-{\hat{Y}}_i\right)^2=3.136
\end{align*} $$

Strengthen your CFA Level I regression testing skills with our Free Trial.

Measures of Goodness of Fit

We use the following measures to analyze the goodness of fit of simple linear regression:

Coefficient of determination.
F-statistic for the test of fit.
Standard error of the regression.

Coefficient of Determination

The coefficient of determination $(R^2)$ measures the proportion of the total variability of the dependent variable explained by the independent variable. It is calculated using the formula below:

$$ \begin{align*} R^2 =\frac{\text{Explained Variation} }{\text{Total Variation}}& =\frac{\text{Sum of Squares Regression (SSR)} }{\text{Sum of Squares Total (SST)}} \\ & =\frac{\sum_{i=1}^{n}\left({\hat{Y}}_i-\bar{Y}\right)^2}{\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)^2} \end{align*} $$

Intuitively, we can think of the above formula as:

$$ \begin{align*}
R^2 & =\frac{\text{Total Variation}-\text{Unexplained Variation} }{\text{Total Variation}}\\
& =\frac{\text{Sum of Squares Total (SST)}-\text{Sum of Squared Errors (SSE)} }{\text{Sum of Squares Total}} \end{align*} $$

Simplifying the above formula gives:

$$ R^2=1-\frac{\text{Sum of Squared Errors (SSE)} }{\text{Sum of Squares Total (SST)}} $$

In the above example, the coefficient of determination is:

$$ \begin{align*}
R^2 & =\frac{\text{Explained Variation} }{\text{Total Variation}} \\ & =\frac{\text{Sum of Squares Regression (SSR)} }{\text{Sum of Squares Total (SST)}} \\ & =\frac{10.568}{13.794}=76.61\% \end{align*} $$

Features of Coefficient of Determination ($R^2$)

$R^2$ lies between 0% and 100%. A high $R^2$ explains variability better than a low $R^2$. If $R^2$=1%, only 1% of the total variability can be explained. On the other hand, if $R^2$=90%, over 90% of the total variability can be explained. In a nutshell, the higher the $R^2$, the higher the model’s explanatory power.

For simple linear regression $(R^2)$ is calculated by squaring the correlation coefficient between the dependent and the independent variables:

$$ r^2=R^2=\left(\frac{Cov\left(X,Y\right)}{\sigma_X\sigma_Y}\right)^2=\frac{\sum_{i=1}^{n}\left({\hat{Y}}_i-\bar{Y}\right)^2}{\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)^2} $$

Where:

$(Cov \left(X,Y\right))$ = Covariance between two variables, $X$ and $Y$.

$(\sigma_X)$ = Standard deviation of $X$.

$(\sigma_Y)$ = Standard deviation of $Y$.

Example: Calculating Coefficient of Determination $({R}^{2})$

An analyst determines that $(\sum_{i= 1}^{6}{\left(Y_i-\bar{Y}\right)^2= 13.704)}$ and $(\sum_{i = 1}^{6}\left(Y_i-{\hat{Y}}_i\right)^2=3.136)$ from the regression analysis of inflation rates on unemployment rates. The coefficient of determination $\left((R^2)\right)$ is closest to:

Solution

$$ \begin{align*}
R^2 & =\frac{{\text{Sum of Squares Total (SST)}-\text{Sum of Squared Errors (SSE)} } }{\text{Sum of Squares Total (SST)}} \\ & =\frac{\left(\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)^2-\sum_{i=1}^{n}\left(Y_i-\hat{Y}\right)^2\right)}{\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)^2}=\frac{13.704-3.136}{13.704} \\ & =0.7712=77.12\% \end{align*} $$

F-statistic in Simple Regression Model

Note that the coefficient of determination discussed above is just a descriptive value. To check the statistical significance of a regression model, we use the F-test, which requires us to calculate the F-statistic.

In simple linear regression, the F-test confirms whether the slope (denoted by $(b_1)$) in a regression model is equal to zero. In a typical simple linear regression hypothesis, the null hypothesis is formulated as: $(H_0:b_1=0)$ against the alternative hypothesis $(H_1:b_1\neq0)$. The null hypothesis is rejected if the confidence interval at the desired significance level excludes zero.

The Sum of Squares Regression (SSR) and Sum of Squares Error (SSE) are employed to calculate the F-statistic. In the calculation, the Sum of Squares Regression (SSR) and Sum of Squares Error (SSE) are adjusted for the degrees of freedom.

The Sum of Squares Regression(SSR) is divided by the number of independent variables (k) to get the Mean Square Regression (MSR). That is:

$$ MSR=\frac{SSR}{k} = \frac{\sum_{i = 1}^{n}\left(\widehat{Y_i}-\bar{Y}\right)^2}{k} $$

Since we only have $(k=1)$, in a simple linear regression model, the above formula changes to:

$$ MSR=\frac{SSR}{1}=\frac{\sum_{i = 1}^{n}\left(\widehat{Y_i}-\bar{Y}\right)^2}{1}=\sum_{i = 1}^{n}\left({\hat{Y}}_i-\bar{Y}\right)^2 $$

Therefore, in the Simple Linear Regression Model, MSR = SSR.

Also, the Sum of Squares Error (SSE) is divided by degrees of freedom given by $(n-k-1)$ (this translates to $(n-2)$ for simple linear regression) to arrive at Mean Square Error (MSE). That is,

$$
MSE=\frac{\text{Sum of Squares Error (SSE)}}{n-k-1}=\frac{\sum_{i=1}^{n}\left(Y_i-\hat{Y}\right)^2}{n-k-1} $$

For a simple linear regression model,

$$ MSE =\frac{\text{Sum of Squares Error(SSE)}}{n-2} =\frac{\sum_{i =1 }^{n}\left(Y_i-\hat{Y}\right)^2}{n-2} $$

Finally, to calculate the F-statistic for the linear regression, we find the ratio of MSR to MSE. That is,

$$ \begin{align*} F-\text{statistic} = \frac{MSR}{MSE} = \frac{\frac{SSR}{k}}{\frac{SSE}{n-k-1}} = \frac{\frac{\sum_{i=1}^{n}\left(\widehat{Y_i}-\bar{Y}\right)^2}{k}}{\frac{\sum_{i = 1 }^{n}\left(Y_i-\hat{Y}\right)^2}{n-k-1}} \end{align*} $$

For simple linear regression, this translates to:

$$ \begin{align*} F-\text{statistic}=\frac{MSR}{MSE} =\frac{\frac{SSR}{k}}{\frac{SSE}{n-k-1}} = \frac{\sum_{i = 1}^{n}\left(\widehat{Y_i}-\bar{Y}\right)^2}{\frac{\sum_{i = 1}^{n}\left(Y_i-\hat{Y}\right)^2}{n-2}} \end{align*} $$

The F-statistic in simple linear regression is F-distributed with $(1)$ and $(n-2)$ degrees of freedom. That is,

$$ \frac{MSR}{MSE}\sim F_{1,n-2} $$

Note that the F-test regression analysis is a one-side test, with the rejection region on the right side. This is because the objective is to test whether the variation in Y explained (the numerator) is larger than the variation in Y unexplained (the denominator).

Interpretation of F-test Statistic

A large F-statistic value proves that the regression model effectively explains the variation in the dependent variable and vice versa. On the contrary, an F-statistic of 0 indicates that the independent variable does not explain the variation in the dependent variable.

We reject the null hypothesis if the calculated value of the F-statistic is greater than the critical F-value.

It is worth mentioning that F-statistics are not commonly used in regressions with one independent variable. This is because the F-statistic is equal to the square of the t-statistic for the slope coefficient, which implies the same thing as the t-test.

Standard Error of Estimate

Standard Error of Estimate, $S_e$ or SEE, is alternatively referred to as the root mean square error or standard error of the regression. It measures the distance between the observed dependent variables and the dependent variables the regression model predicts. It is calculated as follows:

$$ {\text{Standard Error of Estimate}}\left(S_e\right)=\sqrt{MSE}=\sqrt{\frac{\sum_{i = 1}^{n}\left(Y_i-{\hat{Y}}_i\right)^2}{n-2}} $$

The standard error of estimate, coefficient of determination, and F-statistic are the measures that can be used to gauge the goodness of fit of a regression model. In other words, these measures tell the extent to which a regression model syncs with data.

The smaller the Standard Error of Estimate is, the better the fit of the regression line. However, the Standard Error of Estimate does not tell us how well the independent variable explains the variation in the dependent variable.

Hypothesis Tests of Regression Coefficients

Hypothesis Test on the Slope Coefficient

Note that the F-statistic discussed above is used to test whether the slope coefficient is significantly different from 0. However, we may also wish to test whether the population slope differs from a specific value or is positive. To accomplish this, we use the t-distributed test.

The process of performing the t-distributed test is as follows:

State the hypothesis: For instance, typical hypothesis statements include:
- $H_0: b_1 =0 \text{ versus } H_a: b_1 \neq 0$
- $H_0: b_1\le 0 \text{ versus } H_a: b_1> 0$
Identify the appropriate test statistic: The test statistic for the t-distributed test on slope coefficient is given by: $$ t=\frac{{\hat{b}}_1-B_1}{s_{{\hat{b}}_1}} $$Where:$B_1$ = Hypothesized slope coefficient.$\widehat{b_1}$ = Point estimate for $b_1$$s_{{\hat{b}}_1 }$ = Standard error of the slope coefficient.The test statistic is t-distributed with $n-k-1$ degrees of freedom. Since we are dealing with simple linear regression, we will deal with $n-2$ degrees of freedom. The standard error of the slope coefficient $(s_{{\hat{b}}_1})$ is calculated as the ratio of the standard error of estimate $(s_e)$ and the square root of the variation of the independent variable:$$ s_{{\hat{b}}_1\ }=\frac{s_e}{\sqrt{\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2}} $$
Where:

$$ s_e=\sqrt{MSE} $$
Specify the level of significance: Note the level of significance level, usually denoted by alpha, $\alpha$. A typical significance level might be $\alpha=5\%$
State the decision rule: Using the significance level, find the critical values. You can use the t-table or spreadsheets such as Excel, statistical software such as R, or programming languages such as Python. In an exam situation, such critical values will be provided. Compare the t-statistic value to the critical t-value $(t_c)$. Reject the null hypothesis if the absolute t-statistic value is greater than the upper critical t-value or less than the lower critical value, i.e., $t \gt +t_{\text{critical}}$ or $t \lt -t_{\text{critical}}$
Calculate the test statistic: Using the formula above, calculate the test statistic. Intuitively, you might need to calculate the standard error of the slope coefficient $(s_{{\hat{b}}_1})$ first.
Make a decision: Make a decision whether to reject or fail to reject the null hypothesis.

Example: Hypothesis Test Concerning Slope Coefficient

Recall the example where we regressed inflation rates against unemployment rates from 2011 to 2020.

The estimated regression model is

$$ \hat{Y}=7.112-0.9020X_i+\varepsilon_i $$

Assume that we need to test whether the slope coefficient of the unemployment rates is positive at a 5% significance level.

The hypotheses are as follows:

$H_0: b_1 \lt 0 \text{ versus } H_a: b_1\geq 0 $

Next, we need to calculate the test statistic given by:

$t=\frac{{\hat{b}}_1-B_1}{s_{{\hat{b}}_1}} $

Where:

$$ s_{{\hat{b}}_1\ }=\frac{s_e}{\sqrt{\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2}} $$

Recall that,

$$ s_e=\sqrt{MSE}=\sqrt{\frac{SSE}{n-k-1}}=\sqrt{\frac{\sum_{i = 1 }^{n}\left(Y_i-\hat{Y}\right)^2}{n-2}}=\sqrt{\frac{3.136}{8}}=0.6261 $$

So that,

$$ s_{{\hat{b}}_1\ }=\frac{s_e}{\sqrt{\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2}}=\frac{0.6261}{\sqrt{12.989}}=0.1737 $$

Therefore,

$$ t=\frac{{\hat{b}}_1-B_1}{s_{{\hat{b}}_1}}=\frac{-0.9020-0}{0.1737}=-5.193 $$

Next, we need to find critical t-values. Note that this is a one-sided test. As such, we need to find $t_8,0.05$. We will use the t-table:

From the table, $t_8,0.05=1.860$. We fail to reject the null hypothesis since the calculated test statistic is less than the critical t-value $(?5.193 \lt 1.860)$. There is sufficient evidence to indicate that the slope coefficient is not positive.

Relationship between the Hypothesis Test of Correlation and Slope Coefficient

In simple linear regression, a distinct characteristic exists: the t-test statistic checks if the slope coefficient equals zero. This t-test statistic is the same as the test-statistic used to determine if the pairwise correlation is zero.

This feature is true for two-sided tests $(H_0: \rho = 0 \text{ versus } H_a: \rho \neq 0$ and $H_0: b_1 = 0 \text{ versus } H_a: \rho \neq 0)$ and one-sided test $(H_0: \rho\le 0 \text{ versus } Ha: \rho> 0$ and $H_0: b_1\le 0 \text{ versus } H_a: \rho \gt 0$ or $H_0: \rho \gt 0 \text{ versus } H_a: \rho \le 0$ and $H_0: b_1 \gt 0 \text{ versus } H_a: \rho \le 0)$.

Note that the test -statistic to test whether the correlation is equal to zero is given by:

$$ t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}} $$

The above test statistic is t-distributed with $(n-2)$ degrees of freedom.

Consider our previous example, where we regressed inflation rates against unemployment rates from 2011 to 2020. Assume we want to test whether the pairwise correlation between the unemployment and inflation rates equals zero.

In the example, the correlation between unemployment and inflation rates is -0.8782. As such, the test- statistic to test whether the correlation is equal to zero is

$$ t=\frac{-0.8782\sqrt{10-2}}{\sqrt{1-{(-0.8782)}^2}}\approx-5.19 $$

Note this is equal to the test statistic t-test statistic used to perform the hypothesis test whether the slope coefficient is zero:

$$
t=\frac{{\hat{b}}_1-B_1}{s_{{\hat{b}}_1}}=\frac{-0.9020-0}{0.1737}=-5.193 $$

Hypothesis Test of the Intercept Coefficient

Similar to the slope coefficient, we may also want to test whether the population intercept equals a certain value. The process is similar to that of the slope coefficient. However, the test statistic for t-distributed test on slope coefficient is given by:

$$ t=\frac{{\hat{b}}_0-B_0}{s_{{\hat{b}}_0}} $$

Where:

$B_1$ = Hypothesized intercept coefficient.

$\widehat{b_1}$ = Point estimate for $b_1$.

$s_{{\hat{b}}_0}$ = Standard error of the intercept.

The formula for the standard error of the intercept $s_{{\hat{b}}_0}$ is given by:

$$ s_{{\hat{b}}_0}=\sqrt{\frac{1}{n}+\frac{{\bar{X}}^2}{\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2}} $$

Recall the example where inflation rates were regressed against unemployment rates from 2011 to 2020.

The estimated regression model is

$$ \hat{Y}=7.112-0.9020X_i+\varepsilon_i $$

Assume that we need to test whether the intercept is greater than 1 at a 5% significance level.

The hypotheses are as follows:

$$ H_0: b_0\le 1 \text{ versus } H_a: b_0 \gt 1 $$

Next, we need to calculate the test statistic given by:

$$ t=\frac{{\hat{b}}_0-B_0}{s_{{\hat{b}}_0}} $$

Where:

$$ s_{{\hat{b}}_0}=\sqrt{\frac{1}{n}+\frac{{\bar{X}}^2}{\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2}}=\sqrt{\frac{1}{10}+\frac{{5.29}^2}{{12.989}}}=1.501 $$

Therefore,

$$ t=\frac{7.112-1}{1.501}=4.0719 $$

Note that this is a one-sided test. From the table, $t_8,0.05=1.860$. Since the calculated test statistic is less than the critical t-value $(4.0179 \gt 1.860)$, we reject the null hypothesis. There is sufficient evidence to indicate that the intercept is greater than 1.

Hypothesis Tests Concerning Slope Coefficient When Independent Variable is an Indicator Variable

Dummy variables, also known as indicator variables or binary variables, are used in regression analysis to represent categorical data with two or more categories. They are particularly useful for including qualitative information in a model that requires numerical input variables.

Example: Regression Analysis With Indicator Variables

Assume we aim to investigate if a stock’s inclusion in an Environmental, Social, and Governance (ESG) focused fund affects its monthly stock returns. In this case, we’ll analyze the monthly returns of a stock over a 48-month period.

We can use a simple linear regression model to explore this. In the model, we regress monthly returns, denoted as R, on an indicator variable, ESG. This indicator takes the value of 0 if the stock isn’t part of an ESG-focused fund and 1 if it is.

$$ R=b_0+b_1ESG+\varepsilon_i $$

Note that we estimate the simple linear regression in a way similar to if the independent variable was continuous.

The intercept $\beta_0$ is the predicted value when the indicator variable is 0. On the other hand, the slope when the indicator variable is 1 is the difference in the means if we grouped the observations by the indicator variable.

Assume that the following table is the results of the above regression analysis:

$$ \begin{array}{c|c|c|c}
& \textbf{Estimated} & \textbf{Standard Error} & \textbf{Calculated Test} \\
& \textbf{Coefficients} & \textbf{of Coefficients} & \textbf{Statistic} \\ \hline
\text{Intercept} & 0.5468 & 0.0456 & 9.5623 \\ \hline
\text{ESG} & 1.1052 & 0.1356 & 9.9532
\end{array} $$

Additionally, we have the following information regarding the means and variances of the variables.

$$ \begin{array}{c|c|c|c}
& \textbf{Monthly returns} & \textbf{Monthly Returns} & \textbf{Difference in} \\
& \textbf{of ESG Focused} & \textbf{of Non-ESG} & \textbf{Means} \\
& \textbf{Stocks} & \textbf{Stocks} & \\ \hline
\text{Mean} & 1.6520 & 0.5468 & 1.1052 \\ \hline
\text{Variance} & 1.1052 & 0.1356 & \\ \hline
\text{Observations} & 10 & 38 &
\end{array} $$

From the above tables, we can see that:

The intercept (0.5468) equals the mean of the returns for the non-ESG stocks.
The slope coefficient (1.1052) is the difference in means of returns between ESG-focused stocks and non-ESG stocks.

Now, assume we want to test whether the slope coefficient equals 0 at a 5% significance level. Therefore, the hypothesis is $H_0:\beta_1=0 \text{ vs. } H_a:\beta_1\neq0$. Note that the degrees of freedom in $48-2=46$. As such, the critical t-values (usually given in the table above) is $t_{46,0.025}=\pm2.013$.

From the first table above, the calculated test statistic for the slope is greater than the critical t-value $(9.9532 \gt 2.013)$. As a result, we reject the null hypothesis that the slope coefficient is equal to zero.

p-Values and Level of Significance

The p-value is the smallest level of significance level at which the null hypothesis is rejected. Therefore, the smaller the p-value, the smaller the probability of rejecting the true null hypothesis (type I error) and, hence, the greater the validity of the regression model.

Software packages commonly offer p-values for regression coefficients. These p-values help test a null hypothesis that the true parameter equals 0 versus the alternative that it’s not equal to zero.

We reject the null hypothesis if the p-value corresponding to the calculated test statistic is less than the significance level.

Example: Hypothesis Testing of Slope Coefficients

An analyst generates the following output from the regression analysis of inflation on unemployment:

$$\small{\begin{array}{llll}\hline{}& \textbf{Regression Statistics} &{}&{}\\ \hline{}& \text{R Square} & 0.7684 &{} \\ {}& \text{Standard Error} & 0.0063 &{}\\ {}& \text{Observations} & 10 &{}\\ \hline {}& & & \\ \hline{} & \textbf{Coefficients} & \textbf{Standard Error} & \textbf{t-Stat}\\ \hline \text{Intercept} & 0.0710 & 0.0094 & 7.5160 \\\text{Forecast (Slope)} & -0.9041 & 0.1755 & -5.1516\\ \hline\end{array}}$$

At the 5% significant level, test the null hypothesis that the slope coefficient is significantly different from one, that is,

$$ H_{0}: b_{1} = 1 \text{ vs. } H_{a}: b_{1} \neq 1 $$

Solution

The calculated t-statistic, $\text{t}=\frac{\hat{b}_{1}-b_1}{\hat{S}_{b_{1}}}$ is equal to:

$$\begin{align*} {t}= \frac{-0.9041-1}{0.1755} = -10.85\end{align*}$$

The critical two-tail t-values from the table with $n-2=8$ degrees of freedom are:

$$ {t}_{c}=\pm 2.306$$

Notice that $|t| \gt t_{c}$ i.e., ($10.85 \gt 2.306$)

Therefore, we reject the null hypothesis and conclude that the estimated slope coefficient is statistically different from one.
Note that we used the confidence interval approach and arrived at the same conclusion.

Question 1

Samantha Lee, an investment analyst, is studying monthly stock returns. She focuses on companies listed in a Renewable Energy Index across various economic conditions. In her analysis, she performed a simple regression. This regression explains how stock returns vary concerning the indicator variable RENEW. RENEW equals 1 when there’s a positive policy change towards renewable energy during that month, and 0 if not. The total variation in the dependent variable amounted to 220.34. Of this, 94.75 is the part explained by the model. Samantha’s dataset includes 36 monthly observations.

Calculate the coefficient of determination, F-statistic, and standard deviation of monthly stock returns of companies listed in a Renewable Energy Index.

$R^2$=43.00%;F=26.07;Standard deviation=2.51.

$R^2$=53.00%;F=26.41;Standard deviation=2.55.

$R^2$=33.00%;F=36.07;Standard deviation=3.55.

Solution

The correct answer is A.

Coefficient of determination:

$$ R^2=\frac{\text{Explained variation}}{\text{Total variation}}=\frac{94.75}{220.34}\approx43\% $$

F-statistic:

$$ F=\frac{\frac{\text{Explained variation}}{k} }{\frac{\text{Unexplained variation}}{n-2}}=\frac{\frac{SSR}{k}}{\frac{SSE}{n-2}} =\frac{\frac{94.75}{1}}{\frac{220.34-94.75}{34}}=26.07 $$

Standard deviation:

Note that,

$$ \text{Total Variation}= \sum_{i=1}^{n}{\left(Y_i-\bar{Y}\right)^2=220.34} $$

And the standard deviation is given by:

$$ \text{Standard deviation}=\sqrt{\frac{\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)^2}{n-1}} $$

As such,

$$ \text{Standard deviation}=\sqrt{\frac{\text{Total variation}}{n-1}}=\sqrt{\frac{220.34}{n-1}}=2.509 $$

Question 2

Neeth Shinu, CFA, is forecasting the price elasticity of supply for a specific product. Shinu uses the quantity of the product supplied for the past 5months as the dependent variable and the price per unit of the product as the independent variable. The regression results are shown below.

$$\small{\begin{array}{lccccc}\hline \textbf{Regression Statistics} & & & & & \\ \hline \text{R Square} & 0.9941 & & & \\ \text{Standard Error} & 3.6515 & & & \\ \text{Observations} & 5 & & & \\ \hline {}& \textbf{Coefficients} & \textbf{Standard Error} & \textbf{t Stat} & \textbf{P-value}\\ \hline\text{Intercept} & -159 & 10.520 & (15.114) & 0.001\\ \text{Slope} & 0.26 & 0.012 & 22.517 & 0.000\\ \hline\end{array}}$$

Which of the following most likely reports the correct value of the t-statistic for the slope and most accurately evaluates its statistical significance with 95% confidence?

$t=21.67$; the slope is significantly different from zero.

$t= 3.18$; the slope is significantly different from zero.

$t=22.57$; the slope is not significantly different from zero.

Solution

The correct answer is A.

The t-statistic is calculated using the formula:

$$\text{t}=\frac{\hat{b}_{1}-b_1}{\hat{S}_{b_{1}}}$$

Where:

$b_{1}$ = True slope coefficient.

$\hat{b}_{1}$ = Point estimator for $B_{1}$.

$\hat{S}_{b_{1}}$ = Standard error of the regression coefficient.

$$\begin{align*} {t}=\frac{0.26-0}{0.012}=21.67\end{align*}$$

The critical two-tail t-values from the t-table with $n-2 = 3$ degrees of freedom are:

$$t_{c}= \pm 3.18 $$

Notice that $|t| \gt t_{c}$ (i.e., $21.67 \gt 3.18$).

Therefore, the null hypothesis can be rejected. Further, we can conclude that the estimated slope coefficient is statistically different from zero.

Start Free Trial →

Build CFA Level I readiness with structured study materials, guided learning, and focused practice across regression analysis and quantitative methods.

The post Measures of Fit and Hypothesis Tests of Regression Coefficients appeared first on AnalystPrep | CFA® Exam Study Notes.

Assumptions Underlying Linear Regression

Kajal — Wed, 16 Aug 2023 11:18:55 +0000

Assume that we have samples of size $n$ for dependent variable $Y$ and independent variable $X$. We wish to estimate the simple regression of $Y$ and $X$. The classic normal linear regression model assumptions are as follows:

Linearity: A linear relationship implies that the change in $Y$ due to a one-unit change in $X$ is constant, regardless of the value $X$ takes. If the relationship between the two is not linear, the regression model will not accurately capture the trend, resulting in inaccurate predictions. The model will be biased and underestimate or overestimate $Y$ at various points. For example, the model $Y=b_0+b_1e^{b_1x}$ is nonlinear in $b_1$. For this reason, we should not attempt to fit a linear model between $X$ and $Y$. It also follows that the independent variable, $X$, must be non-stochastic (must not be random). A random independent variable rules out a linear relationship between the dependent and independent variables. In addition, linearity means the residuals should not exhibit an observable pattern when plotted against the independent variable. Instead, they should be completely random. In the example below, we’re looking at a scenario where the residuals appear to show a pattern when plotted against the independent variable, $X$. This effectively indicates a nonlinear relation.
Normality Assumption: This assumption implies that the error terms (residuals) must follow a normal distribution. It’s important to note that this doesn’t mean the dependent and independent variables must be normally distributed. However, checking the distribution of the dependent and independent variables is crucial to identify any outliers. A histogram of the residuals can be used to detect if the error term is normally distributed. A symmetric bell-shaped histogram indicates that the normality assumption is likely to be true.
Homoskedasticity: Homoskedasticity implies that the variance of the error terms is constant across all observations. Mathematically, this is expressed as:$$ E\left(\epsilon_i^2\right)=\sigma_\epsilon^2,\ \ i=1,2,\ldots,n $$If the variance of residuals varies across observations, then we refer to this as heteroskedasticity (not homoscedasticity). We plot the least square residuals against the independent variable to test for heteroscedasticity. If there is an evident pattern in the plot, that is a manifestation of heteroskedasticity. In case residuals and the predicted values increase simultaneously, then such a situation is known as heteroscedasticity (or heteroskedasticity).

Strengthen your CFA Level I linear regression assumptions with our Free Trial.

1. 1. Independence Assumption: The independence assumption implies that the observations $X_i$ and $Y_i$ are independent of each other. Failure to satisfy this assumption implies the variables are not independent, and thus, residuals will be correlated. To ascertain this assumption, we visually and statistically analyze the residuals to check whether residual shows exhibit a pattern.

Question

A regression model with one independent variable requires several assumptions for valid conclusions. Which of the following statements most likely violates those assumptions?

The independent variable is random.

The error term is distributed normally.

There exists a linear relationship between the dependent variable and the independent variable.

Solution

The correct answer is A.

Linear regression assumes that the independent variable, X, is NOT random. This ensures that the model produces the correct estimates of the regression coefficients.

B is incorrect. The assumption that the error term is distributed normally allows us to easily test a particular hypothesis about a linear regression model.

C is incorrect. Essentially, the assumption that the dependent and independent variables have a linear relationship is the key to a valid linear regression. If the parameters of the dependent and independent variables are not linear, then the estimation of that relation can yield invalid results.

Start Free Trial →

Access CFA Level I quantitative methods study notes, practice questions, mock exams, and video lessons to strengthen your understanding of the assumptions underlying linear regression.

The post Assumptions Underlying Linear Regression appeared first on AnalystPrep | CFA® Exam Study Notes.

Introduction to Linear Regression

Kajal — Wed, 16 Aug 2023 06:19:52 +0000

Linear regression is a mathematical method used for analyzing how the variation in one variable can explain the variation in another variable.

Let $Y$ be the variable we wish to explain. As such, the observation of this variable is $Y_i$, and $\bar{Y}$ is the mean of the sample size $n$. The variation of $Y$ is given by:

$$ \text{Variation of } Y= \sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)^2 $$

Our main objective is to explain what causes this variation, usually called the sum of squares total (SST).

By definition of the regression, we need to explain the variation of $Y$ with another variable. Let $X$ be the explanatory variable. As such, the observations of $X$ will be denoted by $X_i$ and $\bar{X}$ sample mean of size $n$. The variation of X is given by:

$$ \text{Variation of } X= \sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2 $$

To visualize the relationship between variables X and Y, you can use a scatter plot, also known as a scattergram. In this type of plot, the variable you want to explain (Y) is usually plotted on the vertical axis. In contrast, the explanatory variable (X) is placed on the horizontal axis to show the relationship between their variations.

For example, consider the following table. We wish to use linear regression analysis to forecast inflation, given unemployment data from 2011 to 2020.

$$ \begin{array}{c|c|c}
\text{Year} & \text{Unemployment Rate} & \text{Inflation Rate} \\ \hline
2011 & 6.1\% & 1.7\% \\ \hline
2012 & 7.4\% & 1.2\% \\ \hline
2013 & 6.2\% & 1.3\% \\ \hline
2014 & 6.2\% & 1.3\% \\ \hline
2015 & 5.7\% & 1.4\% \\ \hline
2016 & 5.0\% & 1.8\% \\ \hline
2017 & 4.2\% & 3.3\% \\ \hline
2018 & 4.2\% & 3.1\% \\ \hline
2019 & 4.0\% & 4.7\% \\ \hline
2020 & 3.9\% & 3.6\%
\end{array} $$

In this scenario, the $Y$ variable is the inflation rate, and the $X$ axis is the unemployment rate. A scatter plot of the inflation rates against unemployment rates from 2011 to 2020 is shown in the following figure.

Strengthen your CFA Level I linear regression basics with our Free Trial.

Dependent and Independent Variables

A dependent variable, often denoted as $Y$ , is the variable we want to explain. In contrast, an independent variable, typically denoted as $X$ , explains variations in the dependent variable. The independent variable is also referred to as the exogenous, explanatory, or predicting variable.

In our example above, the inflation rate is the dependent variable, and the unemployment rate is the independent variable.

To understand the relationship between dependent and independent variables, we estimate a linear relationship, usually a straight line. When there’s one independent variable, we use simple linear regression. If there are multiple independent variables, we use multiple regression.

This reading focuses on linear regression.

Least Squares Criterion

In simple linear regression, we assume linear relationships exist between the dependent and independent variables. The aim is to fit a line to the observations of X $(X_is)$ and Y $(Y_is)$ to minimize the squared deviations from the line. To accomplish this, we use the least squares criterion.

The following is a simple linear regression equation:

$$ Y=b_0+b_1X_1+\varepsilon_i,\ \ i=1,2,\ldots,n $$

Where:

$Y$ = Dependent variable.

$b_0$ = Intercept.

$b_1$ = Slope coefficient.

$X$ = Independent variable.

$\varepsilon$ = Error term (Noise).

$b_0$ and $b_1$ are known as regression coefficients. The equation above implies that the dependent is equivalent to the intercept $(b_0)$ plus the product of the slope coefficient $(b_1)$ and the independent variable plus the error term.

The error term is equal to the difference between the observed value of $Y$ and the one expected from the underlying population relation between $X$ and $Y$

Below is an illustration of a simple linear regression model.

As stated earlier, linear regression calculates a line that best fits the observations. In the following image, the line that best fits the regression is clearly the blue one:

Note that we cannot directly observe the population parameters $b_0$ and $b_1$. As such, we observe their estimates, ${\hat{b}}_0$ and ${\hat{b}}_1$. They are the estimated parameters of the population using a sample. In simple linear regression, ${\hat{b}}_0$ and ${\hat{b}}_1$ are such that the sum of squared vertical distances is minimized.

Specifically, we concentrate on the sum of the squared differences between observations $Y_i$ and the respective estimated value ${\hat{Y}}_i$ on the regression line, also called the sum of squares error (SSE).

Note that,

$$ {\hat{Y}}_i={\hat{b}}_0+{\hat{b}}_1X_i+e_i^2 $$

As such,

$$ SSE=\sum_{i=1}^{n}\left(Y_i-{\hat{b}}_0-{\hat{b}}_1X_i\right)^2=\sum_{i=1}^n{(Y_i-\hat Y_i)}^2=\sum_{i=1}^n e_i^2 $$

Note that the residual for the ith observation $(e_i=Y_i-\hat{Y}_i)$ is different from the error term $(\varepsilon_i)$. The error term is based on the underlying population, while the residual term results from regression analysis on a sample.

Conventionally, the sum of the residuals is zero. As such, the aim is to fit the regression line in a simple linear regression that minimizes the sum of squared residual terms.

Estimation and Interpretation of Regression Coefficients

The Slope Coefficient ${\hat{{\beta}}}_{1}$

For a simple linear regression, the slope coefficient is estimated as the ratio of the $Cov(X, Y)$ and $Var (X)$:

$$
{\hat{b}}_1=\frac{Cov\left(X,Y\right)}{Var\left(X\right)}=\frac{\frac{\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)\left(X_i-\bar{X}\right)}{n-1}}{\frac{\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2}{n-1}}=\frac{\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)\left(X_i-\bar{X}\right)}{\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2} $$

The slope coefficient is defined as the change in the dependent variable caused by a one-unit change in the value of the independent variable.

The Intercept ${\hat{{\beta}}}_{0}$

The intercept is estimated using the mean of $X$ and $Y$ as follows:

$$ {\hat{b}}_0=\bar{Y}-{\hat{b}}_1\bar{X} $$

Where:

$\hat{Y}$ = Mean of $Y$.

$\hat{X}$ = Mean of $X$.

The intercept is the estimated value of the dependent variable when the independent variable is zero. The fitted regression line passes through the point equivalent to the means of the dependent and the independent variables in a linear regression model.

Example: Estimating Regression Line

Let us consider the following table. We wish to estimate a regression line to forecast inflation, given unemployment data from 2011 to 2020.

$$ \begin{array}{c|c|c}
\text{Year} & {\text{Unemployment Rate}\% \ (X_is)} & {\text{Inflation Rate}\% \ (Y_is)} \\ \hline
2011 & 6.1 & 1.7 \\ \hline
2012 & 7.4 & 1.2 \\ \hline
2013 & 6.2 & 1.3 \\ \hline
2014 & 6.2 & 1.3 \\ \hline
2015 & 5.7 & 1.4 \\ \hline
2016 & 5.0 & 1.8 \\ \hline
2017 & 4.2 & 3.3 \\ \hline
2018 & 4.2 & 3.1 \\ \hline
2019 & 4.0 & 4.7 \\ \hline
2020 & 3.9 & 3.6
\end{array} $$

We can create the following table:

$$ \begin{array}{c|c|c|c|c|c}
\text{Year} & \text{Unemployment} & \text{Inflation} & \left(Y_i-\bar{Y}\right)^2 & \left(X_i-\bar{X}\right)^2 & (Y_i-\bar{Y}) \\
& {\text{Rate}\% \ (X_is)} & { \text{Rate}\% \ (Y_is)} & & & (X_i-\bar{X}) \\ \hline
2011 & 6.1 & 1.7 & 0.410 & 0.656 & -0.518 \\ \hline
2012 & 7.4 & 1.2 & 1.300 & 4.452 & -2.405 \\ \hline
2013 & 6.2 & 1.3 & 1.082 & 0.828 & -0.946 \\ \hline
2014 & 6.2 & 1.3 & 1.082 & 0.828 & -0.946 \\ \hline
2015 & 5.7 & 1.4 & 0.884 & 0.168 & -0.385 \\ \hline
2016 & 5.0 & 1.8 & 0.292 & 0.084 & 0.157 \\ \hline
2017 & 4.2 & 3.3 & 0.922 & 1.188 & -1.046 \\ \hline
2018 & 4.2 & 3.1 & 0.578 & 1.188 & -0.828 \\ \hline
2019 & 4.0 & 4.7 & 5.570 & 1.664 & -3.044 \\ \hline
2020 & 3.9 & 3.6 & 1.588 & 1.932 & -1.751 \\ \hline
\textbf{Sum} & \bf{52.90} & \bf{ 23.4} & \bf{13.704} & \bf{12.989} & \bf{-11.716} \\ \hline
\textbf{Arithmetic} & \bf{5.29} & \bf{2.34} & & & \\
\textbf{Mean}
\end{array} $$

From the table above, we estimate the regression coefficients:

$$ \begin{align*} {\hat{b}}_1 & =\frac{Cov\left(X,Y\right)}{Var\left(X\right)}=\frac{\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)\left(X_i-\bar{X}\right)}{\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2}=\frac{-11.716}{12.989}=-0.9020 \\ {\hat{b}}_0 & =\bar{Y}-{\hat{b}}_1\bar{X}=2.34-(-0.9020)\times5.29=7.112 \end{align*} $$

As such, the regression model is given by:

$$ \hat{Y}=7.112-0.9020X_i+\varepsilon_i $$

From the above regression model, we can note the following:

The inflation rate is 7.112% if the unemployment rate is 0% (theoretically speaking).
If the unemployment rate increases (decreases) by one unit, say, from 2% to 3%–the inflation rate decreases(increases) by 0.9020%.

In general,

If the slope is positive, a unit increase(decrease) in the independent variable results in an increase(decrease) in the dependent variable.
If the slope is negative, a one-unit increase(decrease) in the independent variable results in a decrease(increase) in the dependent variable.

Furthermore, with the estimated regression model, we can predict the values of the dependent variable based on the value of the independent variable. For instance, if the unemployment rate is 4.5%, then the predicted value of the dependent variable is:

$$ \hat{Y}=7.112-0.9020\times4.5=3.05\% $$

In practice, analysts perform regression analysis using statistical functions in software like Excel, statistical tools like R, or programming languages such as Python.

Cross-sectional and Time Series Regressions

Regression analysis is commonly used with cross-sectional and time series data. In cross-sectional analysis, you compare X and Y observations from different entities, like various companies in the same time period. For instance, you might analyze the link between a company’s R&D spending and stock returns across multiple firms in a year.

Time-series regression analysis involves using data from various time periods for the same entity, like a company or an asset class. For instance, an analyst might examine how a company’s quarterly dividend payouts relate to its stock price over multiple years.

Question

The independent variable in a regression model is most likely the:

Predicted variable.

Predicting variable.

Endogenous variable.

Solution

The correct answer is B.

An independent variable explains the variation of the dependent variable. It is also called the explanatory variable, exogenous variable, or the predicting variable.

A and C are incorrect. A dependent variable is a variable predicted by the independent variable. It is also known as the predicted variable, explained variable, or endogenous variable.

Start Free Trial →

Solve CFA-style questions on simple linear regression, hypothesis testing, and interpreting relationships between variables.

The post Introduction to Linear Regression appeared first on AnalystPrep | CFA® Exam Study Notes.

Applications of Big Data and Data Science

Kajal — Wed, 16 Aug 2023 05:45:02 +0000

Data science is an interdisciplinary field that uses developments in computer science, statistics, and other fields to extract information from Big Data or data in general.

Data Processing Methods

Data analysts and scientists in big data analysis use different data management approaches. They consist of capture, curation, storage, search, and transfer.

Capture: Describes the method by which data is gathered and put into a form that the analytical process may use.
Curation: Data curation ensures the quality and accuracy of the data by undertaking a data cleaning activity. This procedure finds data inaccuracies, and any missing data is compensated for.
Storage: Process of recording, archiving, and accessing data, as well as the fundamental structure of the underlying database:
Search: Involves querying data to locate specific information. With big data, sophisticated techniques are necessary to efficiently retrieve the requested data content.
Transfer: Describes the process of transferring data from the underlying data source or storage place to the underlying analytical instrument.

Strengthen your CFA Level I big data and data science concepts with our Free Trial.

Data Visualization

Visualization encompasses data formatting, display, and summarization through graphical representations. Tables, charts, and trends are commonly used for traditional structured data, while non-traditional unstructured data demands innovative techniques like interactive three-dimensional (3D) graphics, tag clouds, and mind maps.

Fintech is applied in investment management, including text analytics, natural language processing, risk assessment, and algorithmic trading.

Text Analytics and Natural Language

Text analytics employs computer programs to analyze and extract insights, primarily from unstructured text- or voice-based datasets like company filings, written reports, quarterly earnings calls, and social media content. Text analytics can be utilized in predictive analysis to identify potential indicators of future performance, such as consumer sentiment.

Natural language processing (NLP) is an area of study that involves creating computer programs to decipher and analyze human language. Essentially, NLP combines computer science, AI, and linguistics.

Translation, speech recognition, text mining, sentiment analysis, and topic analysis are examples of automated tasks that use NLP. Annual reports, call transcripts, news articles, social media posts, and other text- and audio-based data may all be analyzed using natural language processing (NLP), allowing NLP to discover trends more quickly and accurately than is humanly possible.

Using natural language processing data, earnings projections for a company’s near-term prospects can be created. X (formerly Twitter) sentiments have also been used to gauge an initial public offering (IPO) success.

Python, R, and Excel VBA are frequently used programming languages, whereas SQL, SQLite, and NoSQL are prominent database systems.

Question

Which of the five data processing methods refers to the process of ensuring data quality and accuracy through a data cleaning exercise?

Data search.

Data storage.

Data curation.

The correct answer is C.

Data curation refers to the process of ensuring data quality and accuracy through a data cleaning exercise. It involves uncovering data errors and adjusting for missing data.

A is incorrect. Data search refers to how to query data. Big data requires advanced techniques to locate requested data content.

B is incorrect. Data storage refers to how the data will be recorded, archived, and accessed and the underlying database design.

Start Free Trial →

Access CFA Level I quantitative methods study notes, practice questions, mock exams, and video lessons to strengthen your understanding of big data, data science, and their applications in investment analysis.

The post Applications of Big Data and Data Science appeared first on AnalystPrep | CFA® Exam Study Notes.

CFA Quantitative Methods | AnalystPrep

Data Presentation as a Histogram or a Frequency Polygon

Histogram

Example 1: Histogram

Frequency Polygon

Example 2: Frequency Polygon

Tests of Independence Using Contingency Table Data

Question

Tests of Independence

Parametric versus Non-parametric Tests of Independence

Hypotheses Concerning Population Correlation Coefficient

Parametric Test of a Correlation

Non-Parametric Test of Correlation: The Spearman Rank Correlation Coefficient

Steps of Calculating Spearman’s Rank Correlation Coefficient, \(\bf{{r}_{S}}\)

Hypothesis Test for the Spearman Rank Correlation

Question 1

Question 2

Functional Forms for Simple Linear Regression

Selecting the Correct Functional Form

Question 1

Predicted Value and Prediction Interval of a Dependent Variable

Confidence Interval for Predicted Values

Question 1

Analysis of Variance (ANOVA)

Question

Measures of Fit and Hypothesis Tests of Regression Coefficients

The sum of Squares Total (SST) and Its Components

Measures of Goodness of Fit

Coefficient of Determination

Features of Coefficient of Determination (\(R^2\))

F-statistic in Simple Regression Model

Interpretation of F-test Statistic

Standard Error of Estimate

Hypothesis Tests of Regression Coefficients

Hypothesis Test on the Slope Coefficient

Relationship between the Hypothesis Test of Correlation and Slope Coefficient

Hypothesis Test of the Intercept Coefficient

Hypothesis Tests Concerning Slope Coefficient When Independent Variable is an Indicator Variable

p-Values and Level of Significance

Question 1

Question 2

Assumptions Underlying Linear Regression

Question

Introduction to Linear Regression

Dependent and Independent Variables

Least Squares Criterion

Estimation and Interpretation of Regression Coefficients

The Slope Coefficient \({\hat{{\beta}}}_{1}\)

The Intercept \({\hat{{\beta}}}_{0}\)

Cross-sectional and Time Series Regressions

Question

Applications of Big Data and Data Science

Data Processing Methods

Data Visualization

Text Analytics and Natural Language

Question