13 Simple Regression

Suppose we have some bivariate data: \(\hat{X}_{i}, \hat{Y}_{i}\). First, we inspect it as in Part I. We then assess the association between variables by fitting a line through the data points using a “regression”.

Code

# Bivariate Data from USArrests
xy <- USArrests[,c('Murder','UrbanPop')]
colnames(xy) <- c('y','x')

# Inspect Dataset
# head(xy)
# summary(xy)
plot(y~x, xy, col=grey(0,.5), pch=16,
    xlab='Population Share in Urban Area',
    ylab='Murder Arrests per 100K')
title('Data from American States, 1975', font.main=1)

reg <- lm(y~x, dat=xy)
abline(reg, lty=2)

13.1 Simple Linear Regression

This refers to fitting a linear model to bivariate data.

Model and Objective.

Specifically, our model is \[\begin{eqnarray} \hat{Y}_{i}=b_{0}+b_{1} \hat{X}_{i}+e_{i}, \end{eqnarray}\] where \(b_{0}\) and \(b_{1}\) are parameters, often referred to as “coefficients”, and \(\hat{X}_{i}, \hat{Y}_{i}\) are data for observation \(i\), and \(e_{i}\) is a residual error term that represents the difference between the model and the data.

We then find the parameters which best-fit the data. Specifically, our objective is to \[\begin{eqnarray} \min_{b_{0}, b_{1}} \sum_{i=1}^{n} \left( e_{i} \right)^2 &=& \min_{b_{0}, b_{1}} \sum_{i=1}^{n} \left( \hat{Y}_{i} - [b_{0}+b_{1} \hat{X}_{i}] \right)^2. \end{eqnarray}\] Minimizing the sum of squared errors then yields two optimal coefficients that can be explicitly solved for using math. For the slope coefficient: \[\begin{eqnarray} 0 &=& \sum_{i=1}^{n} 2\left( \hat{Y}_{i} - [b_{0}+b_{1} \hat{X}_{i}] \right) \hat{X}_{i} \\ \Rightarrow \hat{b}_{1} &=& \frac{\sum_{i}^{n}(\hat{X}_{i}-\hat{M}_{X})(\hat{Y}_{i}-\hat{M}_{Y})}{\sum_{i}^{}(\hat{X}_{i}-\hat{M}_{X})^2} = \frac{\hat{C}_{XY}}{\hat{V}_{X}}, \end{eqnarray}\] the latter term being the estimated covariance between \(X\) and \(Y\) divided by the variance of \(X\). (Recall that \(M_{X}\) and \(M_{Y}\) are the means.) For the intercept coefficient: \[\begin{eqnarray} 0 &=& \sum_{i=1}^{n} 2\left( \hat{Y}_{i} - [b_{0}+b_{1} \hat{X}_{i}] \right)\\ \Rightarrow \hat{b}_{0} &=& \hat{M}_{Y}-\hat{b}_{1}\hat{M}_{X} . \end{eqnarray}\] We could alternatively find the best fitting parameters numerically, by trying out different combinations of \((b_{0}, b_{1})\). You can do that in this example, which can help you understand what is happening. The computer uses math to compute the answer directly, because it is much faster.

Code

# Run a Simple Regression
reg <- lm(y~x, dat=xy)
coef(reg)
## (Intercept)           x 
##  6.41594246  0.02093466

# Manual verification
x <- xy[,'x']
y <- xy[,'y']
b1 <- cov( x,y)/var(x)
b1
## [1] 0.02093466
b0 <- mean(y) - b1*mean(x)
b0
## [1] 6.415942

Once we have the coefficient, we can find the sample predictions \[\begin{eqnarray} \hat{y}_{i} &=& \hat{b}_{0}+\hat{b}_{1}\hat{X}_{i}\\ \hat{e}_i &=& \hat{Y}_{i}-\hat{y}_{i} \end{eqnarray}\] and make out-of-sample predictions at design points \(x\) \[\begin{eqnarray} \hat{y}(x) = \hat{b}_0 + \hat{b}_1 x \end{eqnarray}\]

Code

# Find predicted values and residuals
predict(reg)
resid(reg)

# Manual verification
y_hat <- b0+b1*x
y_hat

e <- y - y_hat
e

Note

Suppose we have a dataset with \(n = 3\) observations: \(\{(1,2), (2,2.5), (3,4)\}\).

We can compute the regression coefficients and model predictions in five steps.

Step 1. Sample Means

\[\begin{eqnarray} \hat{M}_X &=& \frac{1+2+3}{3} = 2 \\ \hat{M}_Y &=& \frac{2 + 2.5 + 4}{3} = \frac{17}{6}. \end{eqnarray}\]

Step 2. Covariance and Variance

\[\begin{array}{c|c|c|c|c|c|c|} i & X_i & Y_i & (X_i - \hat{M}_X) & (Y_i - \hat{M}_Y) & (X_i - \hat{M}_X)(Y_i - \hat{M}_Y) & (X_i - \hat{M}_X)^2 \\ \hline 1 & 1 & 2 & -1 & -\tfrac{5}{6} & \tfrac{5}{6} & 1 \\ 2 & 2 & 2.5 & 0 & -\tfrac{1}{3} & 0 & 0 \\ 3 & 3 & 4 & 1 & \tfrac{7}{6} & \tfrac{7}{6} & 1 \\ \end{array}\]

\[\begin{eqnarray} \hat{C}_{XY} &=& \sum_{i=1}^3 (X_i - \hat{M}_X)(Y_i - \hat{M}_Y) /3 = [\tfrac{5}{6} + \tfrac{7}{6}]/3 = 2/3 \\ \hat{V}_X &=& \sum_{i=1}^3 (X_i - \hat{M}_X)^2 = [1 + 1]/3 = 2/3. \end{eqnarray}\]

Step 3. Slope and Intercept

\[\begin{eqnarray} \hat{b}_1 = \frac{\hat{C}_{XY}}{\hat{V}_X} = \frac{2/3}{2/3} = 1, \end{eqnarray}\]

\[\begin{eqnarray} \hat{b}_0 = \hat{M}_Y - \hat{b}_1 \hat{M}_X = \frac{17}{6} - 2 = \frac{5}{6} \approx 0.83. \end{eqnarray}\]

\[\begin{eqnarray} \hat{y}_i = \hat{b}_0 + \hat{b}_1 X_i = \frac{5}{6} + X_i. \end{eqnarray}\]

Step 4. Fitted Values and Residuals

\[\begin{eqnarray} \begin{array}{c|c|c|c|c} i & X_i & Y_i & \hat{y}_i = \tfrac{5}{6} + X_i & \hat{e}_i = Y_i - \hat{y}_i \\ \hline 1 & 1 & 2 & \tfrac{11}{6} & \tfrac{1}{6} \\ 2 & 2 & 2.5 & \tfrac{17}{6} & -\tfrac{1}{3} \\ 3 & 3 & 4 & \tfrac{23}{6} & \tfrac{1}{6} \\ \end{array} \end{eqnarray}\]

Step 5: Prediction at chosen design point \(x=2.5\) \[\begin{eqnarray} \hat{y}(2) &=& \frac{5}{6} + 2.5 = (5+15)/6 = 20/6 \approx 3.3 \\ \end{eqnarray}\]

Code

## Data
xy0 <- data.frame(
    x=c(1, 2, 3),
    y=c(2, 2.5, 4))

## Simple Linear Regression
reg0 <- lm(y~x, dat=xy0)
coef(reg0)
## (Intercept)           x 
##   0.8333333   1.0000000

## Sample Predictions
predict(reg0)
##        1        2        3 
## 1.833333 2.833333 3.833333
resid(reg0)
##          1          2          3 
##  0.1666667 -0.3333333  0.1666667

## Out-of-Sample predictions
predict(reg0, newdata=data.frame(x=c(2.5)))
##        1 
## 3.333333

Goodness of Fit.

First, we qualitatively analyze the ‘’Goodness of fit’’ of our model by plotting it against the data. For a quantitative summary, we can also compute the linear correlation between the model predictions and the sample data: \(\hat{R}_{yY} = \hat{C}_{yY}/[\hat{S}_{y} \hat{S}_{Y}]\). With linear models, we typically compute the “R-squared” statistic directly, \([\hat{R}_{yY}]^2\) using the sums of squared errors (Total, Explained, and Residual) \[\begin{eqnarray} \underbrace{\sum_{i}(\hat{Y}_{i}-\hat{M}_{Y})^2}_\text{TSS} =\underbrace{\sum_{i}(\hat{y}_i-\hat{M}_{Y})^2}_\text{ESS}+\underbrace{\sum_{i}\hat{e}_{i}^2}_\text{RSS}\\ \hat{R}_{yY}^2 = \frac{\hat{ESS}}{\hat{TSS}}=1-\frac{\hat{RSS}}{\hat{TSS}} \end{eqnarray}\] This is also known as the “coefficient of determination”. Intuitively, it is the percentage of all variation explained by our model, and must theorefore fall in \([0,1]\).

Code

# Manually Compute R2
Ehat <- resid(reg)
RSS  <- sum(Ehat^2)
Y <- xy[,'y']
TSS  <- sum((Y-mean(Y))^2)
R2 <- 1 - RSS/TSS
R2
## [1] 0.00484035

# Check R2
summary(reg)$r.squared
## [1] 0.00484035

# Double Check R2
R <- cor(xy[,'y'], predict(reg))
R^2
## [1] 0.00484035

Note

Suppose you have data on education level and wages. Conduct a linear regression. Then summarize the model relationship and how well it fits the data.

Code

library(wooldridge)
xy <- wage1[,c('educ','wage')]

Tip

We use the dataset \(\{(1,2), (2,2.5), (3,4)\}\) to compute Goodness of fit.

From before, we know that \(\hat{M}_Y = \frac{17}{6}\) and the fitted values and residuals from the regression are \[\begin{eqnarray} \hat{y}_1 = \frac{11}{6},\qquad \hat{y}_2 = \frac{17}{6},\qquad \hat{y}_3 = \frac{23}{6}.\\ \hat{e}_1 = \frac{1}{6},\qquad \hat{e}_2 = -\frac{1}{3},\qquad \hat{e}_3 = \frac{1}{6}. \end{eqnarray}\]

Step 1. Total Sum of Squares (TSS)

\[\begin{eqnarray} \begin{aligned} \hat{Y}_1 - \hat{M}_Y &= 2 - \frac{17}{6} = -\frac{5}{6},\\ \hat{Y}_2 - \hat{M}_Y &= 2.5 - \frac{17}{6} = -\frac{1}{3},\\ \hat{Y}_3 - \hat{M}_Y &= 4 - \frac{17}{6} = \frac{7}{6}. \end{aligned} \end{eqnarray}\]

\[\begin{eqnarray} \begin{aligned} \hat{TSS} &= \sum_{i} (\hat{Y}_i - \hat{M}_Y)^2 \\ &= \left(-\frac{5}{6}\right)^2 + \left(-\frac{1}{3}\right)^2 + \left(\frac{7}{6}\right)^2 \\ &= \frac{25}{36} + \frac{1}{9} + \frac{49}{36} = \frac{78}{36} = \frac{13}{6}. \end{aligned} \end{eqnarray}\]

Step 2. Explained Sum of Squares (ESS)

\[\begin{eqnarray} \hat{y}_1 - \hat{M}_Y = -1,\qquad \hat{y}_2 - \hat{M}_Y = 0,\qquad \hat{y}_3 - \hat{M}_Y = 1. \end{eqnarray}\]

\[\begin{eqnarray} \hat{ESS} = \sum_i (\hat{y}_i - \hat{M}_Y)^2. = (-1)^2 + 0^2 + 1^2 = 2. \end{eqnarray}\]

Step 3. Residual Sum of Squares (RSS)

\[\begin{eqnarray} \hat{RSS} = \sum_{i} \hat{e}_i^2 = \left(\frac{1}{6}\right)^2 + \left(-\frac{1}{3}\right)^2 + \left(\frac{1}{6}\right)^2. \end{eqnarray}\]

\[\begin{eqnarray} \hat{RSS} = \frac{1}{36} + \frac{1}{9} + \frac{1}{36} = \frac{1}{36} + \frac{4}{36} + \frac{1}{36} = \frac{6}{36} = \frac{1}{6}. \end{eqnarray}\]

Step 4. Check the Decomposition

\[\begin{eqnarray} \hat{ESS} + \hat{RSS} = 2 + \frac{1}{6} = \frac{12}{6} + \frac{1}{6} = \frac{13}{6} = \hat{TSS}. \end{eqnarray}\]

Step 5. Coefficient of Determination

\[\begin{eqnarray} \hat{R}_{yY}^2 = \frac{\hat{ESS}}{\hat{TSS}} = \frac{2}{13/6} = \frac{12}{13} \approx 0.9 \end{eqnarray}\]

(Verification) \[\begin{eqnarray} 1 - \frac{\hat{RSS}}{\hat{TSS}} = 1 - \frac{1/6}{13/6} = 1 - \frac{1}{13} = \frac{12}{13}. \end{eqnarray}\]

Code

# Compute R2
Ehat0 <- resid(reg0)
RSS0  <- sum(Ehat0^2)
Y0 <- xy0[,'y']
TSS0  <- sum((Y0-mean(Y0))^2)
R2 <- 1 - RSS0/TSS0

# compare with our calculation: 12/13

# compare with our intuitive benchmark
cor(Y0, predict(reg0))
## [1] 0.9607689

13.2 Variability Estimates

A regression coefficient is a statistic. And, just like all statistics, we can estimate it’s variability using a

standard error: variability across different samples (c.f. standard deviation: variability within a single sample.)
confidence interval: range your statistic varies across different samples.

Note that values reported by your computer do not necessarily satisfy this definition. To calculate these statistics, we will estimate variability using the data-driven methods from https://jadamso.github.io/Rbooks/01_07_Intervals.html. (For some theoretical background, see also https://www.sagepub.com/sites/default/files/upm-binaries/21122_Chapter_21.pdf.)

Jackknife.

We first consider the simplest, the jackknife. In this procedure, we loop through each row of the dataset. And, in each iteration of the loop, we drop that observation from the dataset and reestimate the statistic of interest. We then calculate the standard deviation of the statistic across all subsamples to get jackknife standard errors and, assuming the data are approximately normal, use the quantiles to construct a CI. Alternatively, we can just take the percentiles of the jackknife distribution directly.

Code

# Original OLS coefficient
slope <-  coef(reg)[2]
 
# Jackknife Sampling Distribution for OLS Coefficient
jack_coefs <- vector(length=nrow(xy))
for(i in seq_along(jack_coefs)){
    xy_i <- xy[-i,]
    reg_i <- lm(y~x, dat=xy_i)
    slope_i <- coef(reg_i)[2]
    jack_coefs[i] <- slope_i
}


# Plot Jackknife Sampling Distribution
# + Percentile Confidence Interval
hist(jack_coefs, breaks=25,
    main='Jackknife Distribution with 95% CIs',
    font.main=1, border=NA,
    freq=F,
    xlab=expression(hat(b)[-i]))
jack_ci_percentile <- quantile(jack_coefs, probs=c(.025,.975))
abline(v=jack_ci_percentile, lty=2)

# Normal Approx Confidence Interval
jack_se <- sd(jack_coefs)
jack_ci_normal <- qnorm(c(.025,.975), slope, jack_se)
abline(v=jack_ci_normal, lty=3)

Bootstrap.

There are several resampling techniques. The other main one is the bootstrap, which resamples with replacement for an arbitrary number of iterations. When bootstrapping a dataset with \(n\) observations, you randomly resample all \(n\) rows in your data set \(B\) times. We then calculate the standard deviation of the statistic across all boostrap samples to get bootstrap standard errors and, assuming the data are approximately normal, use the quantiles to construct a CI. Alternatively, we can just take the percentiles of the bootstrap distribution.

Code

# Bootstrap Sampling Distribution
boot_coefs <- vector(length=399)
for(i in seq_along(boot_coefs)){
    b_id <- sample( nrow(xy), replace=T)
    xy_b <- xy[b_id,]
    reg_b <- lm(y~x, dat=xy_b)
    slope_b <- coef(reg_b)[2]
    boot_coefs[i] <- slope_b
}

# Plot Bootstrap Sampling Distribution
# + Percentile CI
hist(boot_coefs, breaks=25,
    main='Bootstrap Distribution with 95% CIs',
    font.main=1, border=NA,
    freq=F,
    xlab=expression(hat(b)[b]))
boot_ci_percentile <- quantile(boot_coefs, probs=c(.025,.975))
abline(v=boot_ci_percentile, lty=2)

# Normal Approx Confidence Intervals
boot_se <- sd(boot_coefs)
jack_ci_normal <- qnorm(c(.025,.975), slope, boot_se)
abline(v=jack_ci_normal, lty=3)

13.3 Hypothesis Tests

We can also bootstrap other statistics, often to test a null hypothesis of “no relationship”. We are rarely interested in computing standard errors and conducting hypothesis tests for simple linear regressions in practice, but work through the ideas with two variables before moving to analyze multiple variables.

Invert a CI.

One main way to conduct hypothesis tests is to examine whether a confidence interval contains a hypothesized value. Does the slope coefficient equal \(0\)? For reasons we won’t go into in this class, we typically normalize the coefficient by its standard error: \(\hat{t} = \frac{\hat{b}}{\hat{s}_{\hat{b}}}\), where \(\hat{s}_{\hat{b}}\) is the estimated standard error of the coefficient.

Code

# Create t-stat with jackknife SE
t_hat <- coef(reg)['x']/jack_se

boot_t <-  vector(length=399)
for(b in seq_along(boot_t)){
    # Bootstrap data
    b_id <- sample( nrow(xy), replace=T)
    xy_b <- xy[b_id,]
    # Redo regression
    reg_b <- lm(y~x, dat=xy_b)
    slope_b <- coef(reg_b)[2]
    # Redo Jackknife SEs
    jack_coefs_b <- vector(length=nrow(xy_b))
    for(i in seq_along(jack_coefs_b)){
        xy_b_i <- xy_b[-i,]
        reg_b_i <- lm(y~x, dat=xy_b_i)
        slope_b_i <- coef(reg_b_i)[2]
        jack_coefs_b[i] <- slope_b_i
    }
    jack_se_b <- sd(jack_coefs_b)
    # Redo t with Jackknife SE
    t_hat_b <- slope_b/jack_se_b
    boot_t[b] <- t_hat_b
}

hist(boot_t, breaks=25,
    main='Bootstrap t with Jackknife SE',
    font.main=1, border=NA, freq=F,
    xlab=expression(hat(t)[b]), 
    xlim=range(c(0, boot_t)) )
abline(v=quantile(boot_t, probs=c(.025,.975)), lty=2)
abline(v=0, col="red", lwd=2)

Note

Suppose you have data on education level and wages. Conduct a linear regression. Then construct a \(95\%\) confidence interval for the slope coefficient and test the hypothesis of no relationship.

Code

library(wooldridge)
xy <- wage1[,c('educ','wage')]

Impose the Null.

We can also compute a null distribution. We focus on the simplest: simulations that each impose the null hypothesis and re-estimate the statistic of interest. Specifically, we compute the distribution of \(t\)-values on data with randomly reshuffled outcomes (imposing the null), and compare how extreme the observed value is. We can sample with replacement (i.e., the null bootstrap) or without (permutation), just as with the correlation statistic.

Code

# Null Distribution for Reg Coef
null_t <-  vector(length=399)
for(b in seq_along(null_t)){
    # Permuted data
    xy_b <- xy
    xy_b[,'y'] <- sample( xy_b[,'y'], replace=F) #Bootstrap: replace=T
    reg_b <- lm(y~x, dat=xy_b)
    # Redo regression
    reg_b <- lm(y~x, dat=xy_b)
    slope_b <- coef(reg_b)[2]
    # Redo Jackknife SEs
    jack_coefs_b <- vector(length=nrow(xy_b))
    for(i in seq_along(jack_coefs_b)){
        xy_b_i <- xy_b[-i,]
        reg_b_i <- lm(y~x, dat=xy_b_i)
        slope_b_i <- coef(reg_b_i)[2]
        jack_coefs_b[i] <- slope_b_i
    }
    # Redo t with Jackknife SE
    t_hat_b <- slope_b/ sd(jack_coefs_b)
    null_t[b] <- t_hat_b
}

# Null Distribution
hist(null_t, breaks=25,
    main='Null Distribution',
    font.main=1, border=NA, freq=F,
    xlab=expression(hat(t)[b]),
    xlim=range(null_t))
# 95% CI
null_ci <- quantile(null_t, probs=c(.025,.975))
abline(v=null_ci, lty=2)
# Observed value
abline(v=t_hat, col="red", lwd=2)

Permutations are common when testing against “no association”. From permuted data, we can calculate a \(p\)-value: the probability you would see something as at least as extreme as your statistic under the null (assuming your null hypothesis was true, see https://jadamso.github.io/Rbooks/01_08_HypothesisTests.html). We calculate the \(p\)-value directly from the null distribution. Smaller \(p\)-values suggest more evidence against the null.

Code

# Two Sided Test for P(t > jack_t or t < -jack_t | Null of t=0)
That_NullDist2 <- ecdf(abs(null_t))
Phat2  <-  1-That_NullDist2( abs(t_hat))
Phat2
## [1] 0.6365915
plot(That_NullDist2, xlim=range(null_t, t_hat),
    xlab=expression( abs(hat(t)[b]) ),
    main='Null Distribution', font.main=1)
abline(v=t_hat, col='red')

Note

Suppose you have data on grades completed and wages. Conduct a linear regression. Then compute a \(p\)-value for the null hypothesis of no relationship between education level and wages.

Code

library(wooldridge)
xy <- wage1[,c('educ','wage')]

Caveats.

We could also use hard decision rules, such as “\(p < 0.05\) is a statistically significant finding”. However, just like confidence intervals, those hard decision rules can be sensitive to somewhat arbitrary choices (Why Jackknife SE’s instead of classical ones? Why test at the \(95\%\) level rather than \(90\%\)?)

“No association” is the most common null hypothesis in practice, but we can also use bootstrapping to test against other specific hypothesis: \(\beta\). To impose the null in this case, you recenter the sampling distribution around the hypothetical value; \(\hat{t} = \frac{\hat{b} - \beta}{\hat{s}_{\hat{b}}}\).¹

Although two-sided hypothesis tests are most common, you can also test one-sided hypotheses.

13.4 Association is not Causation

The same caveats about “correlation is not causation” extend to regression. You may be tempted to use the term “the effect”, but that interpretation of a regression coefficient assumes the linear model is true. If you fit a line to a non-linear relationship, then you will still get back a coefficient even though there is no singular the effect: the true relationship is non-linear! Also consider a classic example, Anscombe’s Quartet, which shows four very different datasets that give the same linear regression coefficient. Notice that you understand the problem because we used scatterplots to visual the data.²

Code

# Anscombe's Quartet 

par(mfrow=c(2,2))
for(i in 1:4){
    xi <- anscombe[,paste0('x',i)]
    yi <- anscombe[,paste0('y',i)]
    plot(xi, yi, ylim=c(4,13), xlim=c(4,20),
        pch=16, col=grey(0,.6))
    reg <- lm(yi~xi)
    b <- round(coef(reg)[2],2)
    p <- round(summary(reg)$coefficients[2,4],4)
    abline(reg, col='orange')
    title(paste0("Slope=", b,', p=',p), font.main=1)
}

Code


## For an even better example, see `Datasaurus Dozen'F
#browseURL(
#'https://bookdown.org/paul/applied-data-visualization/
#why-look-the-datasaurus-dozen.html')

It is true that linear regression “is the best linear predictor of the nonlinear regression function if the mean-squared error is used as the loss function.” But this is not a carte-blanche justification for OLS, as the best of the bad predictors is still a bad predictor. For many economic applications, it is more helpful to think and speak of “dose response curves” instead of “the effect”.

While adding interaction terms or squared terms allows one incorporate heterogeneity and non-linearity, they change several features of the model (most of which are not intended). Often, there are nonsensical predicted values. For example, if the most of your age data are between \([23,65]\), a quadratic term can imply silly things for people aged \(10\) or \(90\).

Nonetheless, linear regression provides an important piece of quantitative information that is understood by many. All models are an approximation, and sometimes only unimportant nuances are missing from a vanilla linear model. Other times, that model can be seriously misleading. (This is especially true if your making policy recommendations based on a universal “the effect”.) As an exploratory tool, linear regession is a good guess but one whose point estimates should not be taken too seriously (in which case, the standard errors are also much less important). Before trying to find a regression specification that makes sense for the entire dataset, explore local relationships.

Under some additional assumptions, the null distribution follows a \(t\)-distribution. (For more on parametric t-testing based on statistical theory, see https://www.econometrics-with-r.org/4-lrwor.html.)↩︎
The same principles holds when comparing two groups: http://www.stat.columbia.edu/~gelman/research/published/causal_quartet_second_revision.pdf ↩︎