13  Simple Regression


Suppose we have some bivariate data: \(\hat{X}_{i}, \hat{Y}_{i}\). First, we inspect it as in Part I.

Code
# Bivariate Data from USArrests
xy <- USArrests[,c('Murder','UrbanPop')]
colnames(xy) <- c('y','x')

# Inspect Dataset
# head(xy)
# summary(xy)
plot(y~x, xy, col=grey(0,.5), pch=16)
title('Murder and Urbanization in America 1975', font.main=1)

Now we will assess the association between variables by fitting a line through the data points using a “regression”.

13.1 Simple Linear Regression

This refers to fitting a linear model to bivariate data. Specifically, our model is \[\begin{eqnarray} \hat{Y}_{i}=b_{0}+b_{1} \hat{X}_{i}+e_{i}, \end{eqnarray}\] where \(b_{0}\) and \(b_{1}\) are parameters, often referred to as “coefficients”, and \(\hat{X}_{i}, \hat{Y}_{i}\) are data for observation \(i\), and \(e_{i}\) is a residual error term that represents the difference between the model and the data.

We then find the parameters which best-fit the data. Specifically, our objective function is \[\begin{eqnarray} \min_{b_{0}, b_{1}} \sum_{i=1}^{n} \left( e_{i} \right)^2 &=& \min_{b_{0}, b_{1}} \sum_{i=1}^{n} \left( \hat{Y}_{i} - [b_{0}+b_{1} \hat{X}_{i}] \right)^2. \end{eqnarray}\] Minimizing the sum of squared errors then yields two parameter estimates: \[\begin{eqnarray} 0 &=& \sum_{i=1}^{n} 2\left( \hat{Y}_{i} - [b_{0}+b_{1} \hat{X}_{i}] \right)\\ \hat{b_{0}} &=& \hat{M}_{Y}-\hat{b_{1}}\hat{M}_{X} \\ \end{eqnarray}\] where \(\hat{M}_{Y}\) and \(\hat{M}_{X}\) are sample means. Similarly, \[\begin{eqnarray} 0 &=& \sum_{i=1}^{n} 2\left( \hat{Y}_{i} - [b_{0}+b_{1} \hat{X}_{i}] \right) \hat{X}_{i} \\ \hat{b_{1}} &=& \frac{\sum_{i}^{n}(\hat{X}_{i}-\hat{M}_{X})(\hat{Y}_{i}-\hat{M}_{Y})}{\sum_{i}^{}(\hat{X}_{i}-\hat{M}_{X})^2} = \frac{\hat{C}_{XY}}{\hat{V}_{X}}, \end{eqnarray}\] the latter term being the estimated covariance between \(X\) and \(Y\) divided by the variance of \(X\). This in turn yields model predictions \[\begin{eqnarray} \hat{y}_{i} &=& \hat{b}_{0}+\hat{b}_{1}\hat{X}_{i}\\ \hat{e}_i &=& \hat{Y}_{i}-\hat{y}_{i} \end{eqnarray}\]

Code
# Run a Regression Coefficients
reg <- lm(y~x, dat=xy)
# predict(reg)
# resid(reg)
# coef(reg)

Goodness of Fit.

First, we qualitatively analyze the ‘’Goodness of fit’’ of our model, we plot our predictions for a qualitative analysis

Code
# Plot Data and Predictions
library(plotly)
xy$ID <- rownames(USArrests)
xy$pred <- predict(reg)
xy$resid <- resid(reg)
fig <- plotly::plot_ly(
  xy, x=~x, y=~y,
  mode='markers',
  type='scatter',
  hoverinfo='text',
  marker=list(color=grey(0,.25), size=10),
  text=~paste('<b>', ID, '</b>',
              '<br>Urban  :', x,
              '<br>Murder :', y,
              '<br>Predicted Murder :', round(pred,2),
              '<br>Residual :', round(resid,2)))              
# Add Legend
fig <- plotly::layout(fig,
          showlegend=F,
          title='Crime and Urbanization in America 1975',
          xaxis = list(title='Percent of People in an Urban Area'),
          yaxis = list(title='Homicide Arrests per 100,000 People'))
# Plot Model Predictions
add_trace(fig, x=~x, y=~pred,
    inherit=F, hoverinfo='none',
    mode='lines+markers', type='scatter',
    color=I('black'),
    line=list(width=1/2),
    marker=list(symbol=134, size=5))

For a quantitative summary, we can also compute the linear correlation between the model predictions and the sample data: \(\hat{R}_{yY} = \hat{C}_{yY}/[\hat{S}_{y} \hat{S}_{Y}]\). With linear models, we typically compute the “R-squared” statistic \(\hat{R}_{yY}^2\), also known as the “coefficient of determination”, using the sums of squared errors (Total, Explained, and Residual) \[\begin{eqnarray} \underbrace{\sum_{i}(\hat{Y}_{i}-\hat{M}_{Y})^2}_\text{TSS} =\underbrace{\sum_{i}(\hat{y}_i-\hat{M}_{Y})^2}_\text{ESS}+\underbrace{\sum_{i}\hat{e}_{i}^2}_\text{RSS}\\ \hat{R}_{yY}^2 = \frac{\hat{ESS}}{\hat{TSS}}=1-\frac{\hat{RSS}}{\hat{TSS}} \end{eqnarray}\]

Code
# Manually Compute R2
Ehat <- resid(reg)
RSS  <- sum(Ehat^2)
Y <- xy$y
TSS  <- sum((Y-mean(Y))^2)
R2 <- 1 - RSS/TSS
R2
## [1] 0.00484035

# Check R2
summary(reg)$r.squared
## [1] 0.00484035

# Double Check R2
R <- cor(xy$y, predict(reg))
R^2
## [1] 0.00484035

13.2 Variability Estimates

A regression coefficient is a statistic. And, just like all statistics, we can calculate

  • standard deviation: variability within a single sample.
  • standard error: variability across different samples.
  • confidence interval: range your statistic varies across different samples.

Note that values reported by your computer do not necessarily satisfy this definition. To calculate these statistics, we will estimate variability using data-driven methods. (For some theoretical background, see https://www.sagepub.com/sites/default/files/upm-binaries/21122_Chapter_21.pdf.)

Jackknife.

We first consider the simplest, the jackknife. In this procedure, we loop through each row of the dataset. And, in each iteration of the loop, we drop that observation from the dataset and reestimate the statistic of interest. We then calculate the standard deviation of the statistic across all ``subsamples’’.

Code
# Jackknife Standard Errors for OLS Coefficient
jack_regs <- lapply(1:nrow(xy), function(i){
    xy_i <- xy[-i,]
    reg_i <- lm(y~x, dat=xy_i)
})
jack_coefs <- sapply(jack_regs, coef)['x',]
jack_se <- sd(jack_coefs)
# classic_se <- sqrt(diag(vcov(reg)))[['x']]


# Jackknife Sampling Distribution
hist(jack_coefs, breaks=25,
    main=paste0('SE est. = ', round(jack_se,4)),
    font.main=1, border=NA,
    xlab=expression(hat(b)[-i]))
# Original Estimate
abline(v=coef(reg)['x'], lwd=2)
# Jackknife Confidence Intervals
jack_ci_percentile <- quantile(jack_coefs, probs=c(.025,.975))
abline(v=jack_ci_percentile, lty=2)

Code


# Plot Normal Approximation
# jack_ci_normal <- jack_mean+c(-1.96, +1.96)*jack_se
# abline(v=jack_ci_normal, col="red", lty=3)

Bootstrap.

There are several resampling techniques. The other main one is the bootstrap, which resamples with replacement for an arbitrary number of iterations. When bootstrapping a dataset with \(n\) observations, you randomly resample all \(n\) rows in your data set \(B\) times.

Code
# Bootstrap
boot_regs <- lapply(1:399, function(b){
    b_id <- sample( nrow(xy), replace=T)
    xy_b <- xy[b_id,]
    reg_b <- lm(y~x, dat=xy_b)
})
boot_coefs <- sapply(boot_regs, coef)['x',]
boot_se <- sd(boot_coefs)

hist(boot_coefs, breaks=25,
    main=paste0('SE est. = ', round(boot_se,4)),
    font.main=1, border=NA,
    xlab=expression(hat(b)[b]))
boot_ci_percentile <- quantile(boot_coefs, probs=c(.025,.975))
abline(v=boot_ci_percentile, lty=2)
abline(v=coef(reg)['x'], lwd=2)

Subsampling.

Random subsampling is one of many hybrid approaches that tries to combine the best of the core methods: Bootstrap and Jacknife.

Sample Size per Iteration Number of Iterations Resample
Bootstrap \(n\) \(B\) With Replacement
Jackknife \(n-1\) \(n\) Without Replacement
Random Subsample \(m < n\) \(B\) Without Replacement
Code
# Random Subsamples
rs_regs <- lapply(1:399, function(b){
    b_id <- sample( nrow(xy), nrow(xy)-10, replace=F)
    xy_b <- xy[b_id,]
    reg_b <- lm(y~x, dat=xy_b)
})
rs_coefs <- sapply(rs_regs, coef)['x',]
rs_se <- sd(rs_coefs)

hist(rs_coefs, breaks=25,
    main=paste0('SE est. = ', round(rs_se,4)),
    font.main=1, border=NA,
    xlab=expression(hat(b)[b]))
abline(v=coef(reg)['x'], lwd=2)
rs_ci_percentile <- quantile(rs_coefs, probs=c(.025,.975))
abline(v=rs_ci_percentile, lty=2)

We can also bootstrap other statistics, such as \(\hat{t}\) or \(\hat{R}^2\). We do such things to test a null hypothesis, which is often ``no relationship’’. We are rarely interested in computing standard errors and conducting hypothesis tests for simple linear regressions, but work through the ideas with two variables before moving to analyze multiple variables.

13.3 Hypothesis Tests

Invert a CI.

One main way to conduct hypothesis tests is to examine whether a confidence interval contains a hypothesized value. Does the slope coefficient equal \(0\)? For reasons we won’t go into in this class, we typically normalize the coefficient by its standard error: \(\hat{t} = \frac{\hat{b}}{\hat{s}_{\hat{b}}}\), where \(\hat{s}_{\hat{b}}\) is the estimated standard error of the coefficient.

Code
tvalue <- coef(reg)['x']/jack_se

jack_t <- sapply(jack_regs, function(reg_b){
    # Data
    xy_b <- reg_b$model
    # Coefficient
    coef_b <- coef(reg_b)[['x']]
    t_hat_b <- coef_b/jack_se
    return(t_hat_b)
})

hist(jack_t, breaks=25,
    main='Jackknife t Density',
    font.main=1, border=NA,
    xlab=expression(hat(t)[b]), 
    xlim=range(c(0, jack_t)) )
abline(v=quantile(jack_t, probs=c(.025,.975)), lty=2)
abline(v=0, col="red", lwd=2)

Impose the Null.

We can also compute a null distribution. We focus on the simplest: bootstrap simulations that each impose the null hypothesis and re-estimate the statistic of interest. Specifically, we compute the distribution of t-values on data with randomly reshuffled outcomes (imposing the null), and compare how extreme the observed value is.

Code
# Null Distribution for Reg Coef
boot_t0 <- sapply( 1:399, function(b){
    xy_b <- xy
    xy_b$y <- sample( xy_b$y, replace=T)
    reg_b <- lm(y~x, dat=xy_b)
    coef_b <- coef(reg_b)[['x']]
    t_hat_b <- coef_b/jack_se
    return(t_hat_b)
})

# Null Bootstrap Distribution
boot_ci_percentile0 <- quantile(boot_t0, probs=c(.025,.975))
hist(boot_t0, breaks=25,
    main='Null Bootstrap Density',
    font.main=1, border=NA,
    xlab=expression(hat(t)[b]),
    xlim=range(boot_t0))
abline(v=boot_ci_percentile0, lty=2)
abline(v=tvalue, col="red", lwd=2)

Alternatively, you can impose the null by recentering the sampling distribution around the theoretical value; \(\hat{t} = \frac{\hat{b} - \beta }{\hat{s}_{\hat{b}}}\).1 In any case, we can calculate a p-value: the probability you would see something as extreme as your statistic under the null (assuming your null hypothesis was true). We can always calculate a p-value from an explicit null distribution.

Code
# One Sided Test for P(t > boot_t | Null) = 1 - P(t < boot_t | Null)
That_NullDist1 <- ecdf(boot_t0)
Phat1  <- 1-That_NullDist1(jack_t)


# Two Sided Test for P(t > jack_t or t < -jack_t | Null)
That_NullDist2 <- ecdf(abs(boot_t0))
Phat2  <-  1-That_NullDist2( abs(tvalue))
Phat2
## [1] 0.6516291
plot(That_NullDist2, xlim=range(boot_t0, jack_t),
    xlab=expression( abs(hat(t)[b]) ),
    main='Null Bootstrap Distribution', font.main=1)
abline(v=tvalue, col='red')


  1. Under some assumptions, the null distribution follows a t-distribution. (For more on parametric t-testing based on statistical theory, see https://www.econometrics-with-r.org/4-lrwor.html.)↩︎