After completing this reading you should be able to:

- Explain how regression analysis in econometrics measures the relationship between dependent and independent variables.
- Interpret a population regression function, regression coefficients, parameters, slope, intercept, and the error term.
- Interpret a sample regression function, regression coefficients, parameters, slope, intercept, and the error term.
- Describe the key properties of a linear regression.
- Define an ordinary least squares (OLS) regression and calculate the intercept and slope of the regression.
- Describe the method and three key assumptions of OLS for estimation of parameters.
- Summarize the benefits of using OLS estimators.
- Describe the properties of OLS estimators and their sampling distributions, and explain the properties of consistent estimators in general.
- Interpret the explained sum of squares, the total sum of squares, the residual sum of squares, the standard error of the regression, and the regression R2.
- Interpret the results of an OLS regression.

## The Linear Regression Model

Consider the situation of an elementary school director who is caught between hiring more teachers and cutting class sizes so that students can receive more individualized attention from their teachers. How will cutting the size of classes affect the performance of students?

The question we should be asking ourselves is, how is reducing the average class size by a certain amount, say, two students going to affect the students’ standardized test score? This could be formulated in a mathematical equation, as shown:

$$ { \beta }_{ class\quad size }=\frac { Change\quad in\quad test\quad score }{ Change\quad in\quad class\quad size } =\frac { \Delta Test\quad Score }{ \Delta Class\quad size } $$

The subscript under the Greek letter beta, \(class\quad size\), is the effect of changing the size of the class. This equation defines the slope of a straight line with respect to \(test\quad scores\) and \(class\quad size\).

This equation could be rearranged such that:

$$ \Delta Test\quad Score = { \beta }_{ class\quad size } \times \Delta Class\quad size $$

The following is the straight line equation:

$$ Test\quad Score={ \beta }_{ 0 }+{ \beta }_{ class\quad size }\times Class\quad size $$

The intercept of this straight line is given as \({ \beta }_{ 0 }\). For the equation to accommodate all other factors that can influence the performance of the students, all factors other than the size of classes must be taken into consideration, which is also called the error term. Therefore, the equation is written as follows:

$$ Test\quad Score={ \beta }_{ 0 }+{ \beta }_{ class\quad size }\times Class\quad size+Other\quad Factors $$

Now let us change the scope of this discussion. Assume that we are dealing with a sample of \(n\) schools. In the \(i\)th school, the average test score is \({ Y }_{ i }\). Further, assume that all the other factors are denoted by \({ U }_{ i }\), and the average size of the classes denoted as \({ X }_{ i }\). Then the above equation can be generally rewritten in the following manner:

$$ { Y }_{ i }={ \beta }_{ 0 }+{ \beta }_{ 1 }{ X }_{ i }+{ U }_{ i } $$

This is the equation of a linear regression model having one regressor. \(Y\) and \(X\) are the dependent variable and the independent variable or the regressor, respectively.

The population regression line is given as \({ \beta }_{ 0 }+{ \beta }_{ 1 }{ X }_{ i }\). It is also referred to as the population regression function. The coefficients of the population regression line/function are the intercept \({ \beta }_{ 0 }\) and slope \({ \beta }_{ 1 }\). Finally, the error term is denoted by \({ U }_{ i }\).

## Estimating the Coefficients of the Linear Regression Model

The population value of \({ \beta }_{ class\quad size }\) is unknown. However, a sample of the data can be applied in order to evaluate it. If a straight line was to be drawn using the data, then \({ \beta }_{ class\quad size }\) is best estimated by evaluating its slope. Choosing among many possible lines from the plotted points is a big challenge. Applying the OLS estimator is the most popular approach of choosing the line that produces the least squares fit these data.

## The Ordinary Least Squares Estimator

The regression coefficients chosen by the OLS estimators are such that the observed data and the regression line are as close as possible. When predicting \(Y\) given \(X\), the sum of squared mistakes made is the measure of closeness. The least squares estimator of the of the population mean, \(E\left( Y \right) \), is the sample average, \(\bar { Y } \). The total squared estimation mistakes:

$$ \sum _{ j=1 }^{ n }{ { \left( { Y }_{ i }-m \right) }^{ 2 } } $$

are minimized by \(\bar { Y } \), among all possible estimators \(m\).

Assume that \({ \beta }_{ 0 }\) and \({ \beta }_{ 1 }\) have estimators given as \({ b }_{ 0 }\) and \({ b }_{ 1 }\) respectively. The following is the equation of the regression line:

$$ { b }_{ 0 } + { b }_{ 1 }X $$

\({ Y }_{ i }\) takes the following value as predicted by this line:

$$ { Y }_{ i }={ b }_{ 0 }+{ b }_{ 1 }{ X }_{ i } $$

Therefore, one can make the following mistake in predicting the \(i\)th observation:

$$ { Y }_{ i }-\left( { b }_{ 0 }+{ b }_{ 1 }{ X }_{ i } \right) ={ Y }_{ i }-{ b }_{ 0 }-{ b }_{ 1 }{ X }_{ i } $$

These squared prediction mistakes are summed up as follows, \(n\) is the number of observations:

$$ { \Sigma }_{ i=1 }^{ n }{ \left( { Y }_{ i }-{ b }_{ 0 }-{ b }_{ 1 }{ X }_{ i } \right) }^{ 2 }\quad \quad \quad \quad \quad I $$

The ordinary least squares (OLS) estimators of \({ \beta }_{ 0 }\) and \({ \beta }_{ 1 }\) are the estimators of the intercept and the slope minimizing the sum of squared mistakes. The straight line constructed by applying the OLS estimators \({ \hat { \beta } }_{ 0 }+{ \hat { \beta } }_{ 1 }X\), is called the OLS regression line.

The following is the equation of the predicted value of \({ \hat { Y } }_{ i }\) given \({ X }_{ j }\).

$$ { \hat { Y } }_{ i }={ \hat { \beta } }_{ 0 }+{ \hat { \beta } }_{ 1 }{ X }_{ j } $$

The difference between \({ Y }_{ i }\) and its predicted value is the residual of the \(i\)th observation, as shown below:

$$ { \hat { u } }_{ i }={ Y }_{ i }-{ \hat { Y } }_{ i } $$

To determine the values of \({ b }_{ 0 }\) and \({ b }_{ 1 }\) that minimizes the total squared mistakes in \(I\) above, different values of \({ b }_{ 0 }\) and \({ b }_{ 1 }\) are tried out repeatedly, for the calculation of the OLS estimators \({ \hat { \beta } }_{ 0 }\) and \({ \hat { \beta } }_{ 1 }\).

OLS regression requires three key assumptions:

- The expected value of the error term, conditional on the independent variable, is zero (E(ε_i |X_i) = 0).
- All (X, Y) observations are independent and identically distributed (i.i.d.).
- There are no large outliers in the data, which would have the potential to create misleading regression results.

Other assumptions include:

- The error term is normally distributed;
- A linear relationship exists between the independent and dependent variables;
- The independent variable is uncorrelated with the error terms; and
- There are no omitted variables.

## OLS Estimates of the Relationship between Test Scores and the Student-Teacher Ratio

Assume we want to evaluate the line relating the student-teacher ratio by applying OLS to test the scores of a given number of observations, say 500. Further, assume that we are given the slope estimate and the estimated intercept for the 500 observations as -0.3 and 90, respectively. Then we will have the OLS regression line as:

$$ \widehat { Test\quad Score } =90-0.3\times STR $$

Where \(\widehat { Test\quad Score }\) is the predicted average test score in the school based on the OLS regression line, and \(STR\) is the student-teacher ratio. The -0.3 slope is the decline in schoolwide test scores by -0.3 points as a result of increasing the student-teacher ratio by one student per class.

The implication of the negative slope is that poor performances are attributed to the increased number of students per teacher.

## Reasons for Applying the OLS Estimator

The OLS can be said to be the common communication medium as far as regression analysis is concerned. Virtually all spreadsheets and statistical software packages have the OLS formula implanted in them, thus increasing the ease of using OLS.

The OLS estimator is also said to be unbiased and consistent, based on some assumptions that will be studied later. This implies that the estimator is efficient.

## Measures of Fit

To determine the accuracy within which the OLS regression line fits the data, we apply the \({ R }^{ 2 }\) and the regression’s standard error. The \({ R }^{ 2 }\) is a measure of the fraction of the variance of \({ Y }_{ i }\) as explained by \({ X }_{ i }\), and its values lie between 0 and 1. To measure the range between \({ Y }_{ i }\) and its predicted value, then the standard error of the regression comes into play.

### \({ R }^{ 2 }\)

The fraction of the sample variance of \({ Y }_{ i }\) explained by \({ X }_{ i }\) is referred to as the regression \({ R }^{ 2 }\). The dependent variable \({ Y }_{ i }\) can be written as follows:

$$ { Y }_{ i }=\hat { { Y }_{ i } } +{ \hat { u } }_{ i } $$

Where \({ \hat { u } }_{ i }\) is the residual, and \(\hat { { Y }_{ i } }\) is the predicted value. The ratio of the sample variance of \(\hat { { Y }_{ i } }\) to the sample variance of \({ Y }_{ i }\) is \({ R }^{ 2 }\).

\({ R }^{ 2 }\) can be expressed mathematically as the ratio of the explained sum of squares (ESS) to the total sum of squares (TSS). For ESS, the squared deviations of the predicted values of \({ Y }_{ i }\), \(\hat { { Y }_{ i } }\), from their average are summed up. The sum of squared deviations of \({ Y }_{ i }\) from its average is referred to as the total sum of squares.

$$ ESS=\sum _{ j=1 }^{ n }{ { \left( \hat { { Y }_{ i } } -\bar { Y } \right) }^{ 2 } } $$

$$ TSS=\sum _{ j=1 }^{ n }{ { \left( { Y }_{ i }-\bar { Y } \right) }^{ 2 } } $$

Therefore:

$$ { R }^{ 2 }=\frac { ESS }{ TSS } $$

Furthermore, the fraction of the variance of \({ Y }_{ i }\) not explained by \({ X }_{ i }\) is an alternative way of expressing \({ R }^{ 2 }\). In this regard, we apply the sum of squared residuals, SSR, defined as:

$$ SSR=\sum _{ j=1 }^{ n }{ { \hat { u } }_{ i }^{ 2 } } $$

It is important to note that:

$$ TSS=ESS+SSR $$

Therefore:

$$ { R }^{ 2 }=1-\frac { SSR }{ TSS } $$

The square of the correlation coefficient between \(Y\) and \(X\) gives the \({ R }^{ 2 }\) of the regression \(Y\) on the single regressor \(X\).

### The Standard Error of the Regressor

The standard deviation of the regression error \({ u }_{ i }\) is estimated by the standard error of the regression (SER).

Let \({ u }_{ 1 },\dots { u }_{ n }\) be the unobserved regression errors. Then SER is given as:

$$ SER={ s }_{ \hat { u } } $$

$$ { s }_{ \hat { u } }^{ 2 }=\frac { 1 }{ n-2 } \sum _{ i=1 }^{ n }{ { \hat { u } }_{ i }^{ 2 } } =\frac { SSR }{ n-2 } $$

The formula for \({ s }_{ \hat { u } }^{ 2 }\) only holds when the sample average of the OLS residuals is zero. We use the divisor \(n – 2\) since it corrects for a slight bias attributed to the estimation of two regression coefficients. This correction is often termed as “the degrees of freedom correction”

## The Least Squares Assumptions

#### Assumption number 1: The conditional distribution of \({ u }_{ i }\) given \({ X }_{ i }\) has a mean of zero

According to this assumption, the other factors contained in \({ u }_{ i }\) are not related to \({ X }_{ i }\), as these other factors will have a mean of zero, for a given value of \({ X }_{ i }\).

#### Assumption number 2: \(\left( { X }_{ i },{ Y }_{ i } \right) ,i=1,\dots ,n\) Are independently and identically distributed

This assumption concerns the drawing of the sample. According to this assumption, \(\left( { X }_{ i },{ Y }_{ i } \right) ,i=1,\dots ,n\) are \(i.i.d\) in case a simple random sampling is applied when drawing observations from a single large population. Despite the \(i.i.d\) assumption being a reasonable assumption for many data collection schemes, all sampling schemes do not produce \(i.i.d\) observations on \(\left( { X }_{ i },{ Y }_{ i } \right)\).

#### Assumption number 3: Large outliers are unlikely

In this assumption, observations whose values of \({ X }_{ i }\) and/or \({ Y }_{ i }\) fall far outside the usual range of the data, are unlikely. These observations are known as large outliers. Results of OLS regression can be misleading due to large outliers. We assume that \(X\) and \(Y\) possess finite moments that are nonzero. This implies that \(X\) and \(Y\) have a finite kurtosis:

$$ 0<E\left( { X }_{ i }^{ 4 } \right) <\infty $$

$$ 0<E\left( { Y }_{ i }^{ 4 } \right) <\infty $$

The large-sample estimations to the distributions of the OLS test statistics are justified by the application of the finite kurtosis assumption. Errors during data entry is a good source of large outliers.

## The Application of the Least Squares Assumptions

Firstly, the least squares play a mathematical role. The sampling distributions of OLS estimators are normal in large estimates if the assumptions hold. Hypothesis testing methods are developed and confidence intervals constructed using the OLS estimators, aided by this large-sample normal distribution.

The least squares assumptions are also responsible for organizing the circumstances posing difficulties for OLS regression.

However, the second assumption is considered inappropriate for time series data despite the fact that it plausibly holds in most cross-sectional data sets. In this regard, modification for some applications with time series data might be necessary for the regression models created under this assumption.

According to the third assumption, OLS and the sample mean are similar in the sense that it can be sensitive to large outliers. Therefore, the observations should be correctly recorded and belong in the data set.

## The Sampling Distribution of the OLS Estimators

#### Review of the Sampling Distribution of \(\bar { Y } \)

The random variable \(\bar { Y } \) takes on different values from one sample to the next, since a randomly drawn sample is used to compute it. Its sampling distribution summarizes the probability of these different values.

\(\bar { Y } \) is considered an unbiased estimator of the population mean of \(Y,E\left( \bar { Y } \right) ={ \mu }_{ Y }\). According to the central limit theorem, the distribution is approximately normal.

#### The Sampling Distribution of \({ \hat { \beta } }_{ 0 }\) and \({ \hat { \beta } }_{ 1 }\)

\({ \hat { \beta } }_{ 0 }\) and \({ \hat { \beta } }_{ 1 }\) are random variables taking on different values from one sample to the next. This is due to the fact that the computation of OLS estimators is done using a random sample. Their sampling distribution summarizes the probability of these different values.

It is important to note that:

$$ E\left( { \hat { \beta } }_{ 0 } \right) ={ \beta }_{ 0 },\quad and $$

$$ E\left( { \hat { \beta } }_{ 1 } \right) ={ \beta }_{ 1 } $$

This means that the unbiased estimators of \({ \beta }_{ 0 }\) and \({ \beta }_{ 1 }\) are \({ \hat { \beta } }_{ 0 }\) and \({ \hat { \beta } }_{ 1 }\). Furthermore, according to the central limit theorem, the bivariate normal distribution is applied when approximating the sample distribution of \({ \hat { \beta } }_{ 0 }\) and \({ \hat { \beta } }_{ 1 }\). This only applies in large samples where \({ \hat { \beta } }_{ 0 }\) and \({ \hat { \beta } }_{ 1 }\) have normal marginal distributions.

If least squares approximations in the previous section held, then the sampling distribution of \({ \hat { \beta } }_{ 0 }\) and \({ \hat { \beta } }_{ 1 }\) will be jointly normal. \({ \hat { \beta } }_{ 1 }\) will have a large-sample normal distribution of:

$$ N\left( { \beta }_{ 1 },{ \sigma }_{ \hat { { \beta }_{ 1 } } }^{ 2 } \right) $$

This distribution will have a variance of:

$$ { \sigma }_{ \hat { { \beta }_{ 1 } } }^{ 2 }=\frac { 1 }{ n } \frac { var\left[ \left( { X }_{ i }-{ \mu }_{ X } \right) \right] }{ \left[ var{ \left( { X }_{ i } \right) }^{ 2 } \right] } $$

$$ { \sigma }_{ \hat { { \beta }_{ 0 } } }^{ 2 }=\frac { 1 }{ n } \frac { var\left( { H }_{ i },{ u }_{ i } \right) }{ \left[ E{ \left( { H }_{ i }^{ 2 } \right) }^{ 2 } \right] } $$

Where:

$$ { H }_{ i }=1-\left( \frac { { \mu }_{ X } }{ E\left( { X }_{ i }^{ 2 } \right) } \right) { X }_{ i } $$

Based on the above equations, we can say that the OLS estimates are consistent. There are high chances that the true population coefficients, \({ \beta }_{ 0 }\) and \({ \beta }_{ 1 }\), and \({ \hat { \beta } }_{ 0 }\) and \({ \hat { \beta } }_{ 1 }\) will be close, provided that the sample size is large. This can be attributed to the fact that an increase in \(n\) is followed by a corresponding decrease of the variances, \({ \sigma }_{ \hat { { \beta }_{ 1 } } }^{ 2 }\) and \({ \sigma }_{ \hat { { \beta }_{ 0 } } }^{ 2 }\), of the estimators. Moreover, a large variance of \({ X }_{ i }\) implies that \({ \sigma }_{ \hat { { \beta }_{ 1 } } }^{ 2 }\) will be smaller and \({ \hat { \beta } }_{ 1 }\) will be more precise.