The Impact of Competitive Position on ...
R-squared \(\bf{(R^2)}\) measures how well an estimated regression fits the data. It is also known as the coefficient of determination and can be formulated as:
$$ R^2=\frac{\text{Sum of regression squares}}{\text{Sum of squares total}}=\frac{{\sum_{i=1}^{n}{(\widehat{Y_i}-\bar{Y})}}^2}{{\sum_{i=1}^{n}{(Y_i-\bar{Y})}}^2} $$
Where:
\(n\) = Number of observations.
\(Y_i\) = Dependent variable observations.
\(\widehat{Y_i}\) = Dependent variables predicted value to the independent variable.
\(\bar{Y}\)= Dependent variable mean.
In the presence of independent variables, \(R^2\) will either increase or remain constant. However, \(R^2\) cannot be used to measure the goodness of fit of a model as it will not decrease with the addition of independent variables.
An overfitted regression model is one with too many independent variables to the number of observations in a sample. Overfitting may produce coefficients that do not reflect the true relationship between the independent and dependent variables.
Multiple regression software packages usually produce an adjusted \(\bf{R^2} (\bar{R}^2)\) as an alternative measure of goodness of fit. Using adjusted \(R^2\) in regression is beneficial since it does not automatically increase when more independent variables are included, given that it adjusts for degrees of freedom.
$$ \bar{R^2}=1-\left[\cfrac{\frac{\text{Sum of squares error} }{n-k-1}}{\frac{\text{Sum of squares total}}{n-1}}\right] $$
Therefore, the relationship between \(\bar{R^2}\) and \(R^2\) can mathematically be derived as follows:
$$ \bar{R^2}=1-\left[\left(\frac{n-1}{n-k-1}\right)\ \left(1-R^2\right)\right] $$
Note that:
When including a new variable in the regression, the following should be taken into consideration:
One of the outputs of multiple regression is the ANOVA table. The following shows the general structure of an Anova table.
$$ \begin{array}{c|c|c|c} \textbf{ANOVA} & \textbf{Df (degrees} & \textbf{SS (Sum of squares)} & \textbf{MSS (Mean sum} \\ & \textbf{of freedom)} & & \textbf{of squares)}\\ \hline \text{Regression} & k & \text{RSS} & MSR \\ & & \text{(Explained variation)} & \\ \hline \text{Residual} & n-(k+1) & \text{SSE} & MSE \\ & & \text{(Unexplained variation)} & \\ \hline \text{Total} & n-1 & \text{SST} & \\ & & \text{(Total variation) } & \end{array} $$
We can use the information in an ANOVA table to determine \(R^2\), the F-statistic, and the standard error estimates (SEE) as expressed below:
$$ R^2=\frac{RSS}{SST} $$
$$ F=\frac{MSR}{MSE} $$
$$ SEE=\sqrt{MSE} $$
Where:
$$ \begin{align*} MSR & =\frac{RSS}{k} \\ MSE & =\frac{SSE}{n-k-1} \end{align*} $$
Consider the following regression results generated from multiple regression analysis of the price of the US Dollar index on the inflation rate and real interest rate.
$$ \begin{array}{cccc} \text{ANOVA} & & & \\ \hline & \text{df} & \text{SS} & \text{Significance F} \\ \hline \text{Regression} & 2 & 432.2520 & 0.0179 \\ \text{Residual} & 7 & 200.6349 & \\ \text{Total} & 9 & 632.8869 & \\ \hline \\ & \text{Coefficients} & \text{Standard Error} & \\ \hline \text{Intercept} & 81 & 7.9659 & \\ \text{Inflation rates} & -276 & 233.0748 & \\ \text{Real interest Rates} & 902 & 279.6949 & \\ \hline \end{array} $$
Given the above information, the regression equation can be expressed as:
$$ P=81-276INF+902IR $$
Where:
\(P\) = Price of USDX.
\(INF\) = Inflation rate.
\(IR\) = Real interest rate.
\(R^2\) and adjusted \(R^2\) can also be calculated as follows:
$$ \begin{align*} R^2 & =\frac{RSS}{SST}=\frac{432.2520}{632.8869}=0.6830=68.30\% \\ \\ \text{Adjusted } R^2 & =1-\left(\frac{n-1}{n-k-1}\right)\left(1-R^2\right)=1-\frac{10-1}{10-2-1}\left(1-0.6830\right) \\ & =0.5924 = 59.24\% \end{align*} $$
It’s important to note the following:
Question
Which of the following is most appropriate for adjusted \(R^2\)?
- It is always positive.
- It may or may not increase when one adds an independent variable.
- It is non-decreasing in the number of independent variables.
Solution
The correct answer is B.
The value of the adjusted \(R^2\) increases only when the added independent variables improve the fit of the regression model. Moreover, it decreases when the added variables do not improve the model fit sufficiently.
A is incorrect: The adjusted \(R^2\) can be negative if \(R^2\) is low enough. However, multiple \(R^2\) is always positive.
C is incorrect: The adjusted \(R^2\) can decrease when the added variables do not improve the model fit by a good enough amount. However, multiple \(R^2\) is non-decreasing in the number of independent variables. For this reason, it is less reliable as a measure of goodness of fit in regression with more than one independent variable than in a one-independent variable regression.