15  Inference


Often, we are interested in gradients: how \(Y\) changes with \(X\). The linear model depicts this as a simple constant, \(\hat{b}_{1}\), whereas other models do not. A great first step to assess gradients is to plot the predicted values over the explanatory values. A great second step is to compute gradients and summarize them. In any case, we compute confidence intervals to account for variability across samples.

15.1 Applied Methods

LOESS

An important extension of locally linear regressions is called loess, which uses adaptive bandwidths in order to have a similar number of data points around each design point. This is especially useful when \(\hat{X}_{i}\) is not uniform.

Code
## Data
library(Ecdat)
dat <- Wages1[, c('wage', 'school')]
dat <- dat[order(Wages1[, 'school']), ]

# Adaptive-width subsamples with non-uniform weights
dat <- Wages1[order(Wages1[, 'school']), c('wage', 'school')]
plot(wage ~ school, pch=16, col=grey(0, .1), data=dat,
    main=NA, xlab='School', ylab='Wage')

reg_lo4 <- loess(wage ~ school, data=dat, span=.4)
reg_lo8 <- loess(wage ~ school, data=dat, span=.8)

cols <- hcl.colors(3, alpha=.75)[-3]
lines(dat[, 'school'], predict(reg_lo4),
    col=cols[1], type='o', pch=2)
lines(dat[, 'school'], predict(reg_lo8),
    col=cols[2], type='o', pch=2)

legend('topleft', title='Loess',
    legend=c('span=.4 ', 'span=.8'),
    lty=1, col=cols, cex=.8)

Confidence Bands.

We can construct confidence bands using the jackknife or bootstrap.

Code
# Loess
reg_lo <- loess(wage ~ school, data=dat, span=.6)

# Predictions at design points
pred_design <- data.frame(school=unique(dat[, 'school']))
preds_lo <- predict(reg_lo, newdata=pred_design)

# Plot
lo_col <- hcl.colors(3, alpha=.5)[2]
plot(wage ~ school, pch=16, col=grey(0, .1), data=dat,
    main=NA, xlab='School', ylab='Wage')
lines(pred_design[, 'school'], preds_lo,
    col=lo_col,
    type='o', pch=2)

# Boot CI
boot_lo <- matrix(NA, nrow=nrow(pred_design), ncol=399)
for (b in 1:399) {
    # xy_i <- xy[-i, ] #jackknife for i in 1:nrow(xy)
    xy_b <- dat[sample(nrow(dat), replace=T), ]
    reg_b <- loess(wage ~ school, data=xy_b, span=.8)
    boot_lo[, b] <- predict(reg_b, newdata=pred_design)
}
boot_cb <- apply(boot_lo, 1, quantile,
    probs=c(.025, .975), na.rm=T)

# Plot CI
polygon(
    c(pred_design[[1]], rev(pred_design[[1]])),
    c(boot_cb[1, ], rev(boot_cb[2, ])),
    col=lo_col,
    border=NA)

Construct a bootstrap confidence band for the following loess regression

Code
# Adaptive-width subsamples with non-uniform weights
xy <- USArrests[, c('UrbanPop', 'Murder')]
xy0 <- xy[order(xy[, 'UrbanPop']), ]
names(xy0) <- c('x', 'y')

plot(y ~ x, pch=16, col=grey(0, .5), data=xy0,
    main=NA, xlab='Urban Population', ylab='Murder Arrests')
reg_lo2 <- loess(y ~ x, data=xy0, span=.6)

red_col <- rgb(1, 0, 0, .5)
lines(xy0[, 'x'], predict(reg_lo2),
    col=red_col, type='o', pch=2)

Gradients.

There are three ways to summarize gradients

  1. For all methods, including regressograms, you can approximate gradients with small finite differences. For some small difference \(d\), we can manually compute \[\begin{eqnarray} \hat{b}_{1}(x) &=& \frac{ \hat{y}(x+\frac{d}{2}) - \hat{y}(x-\frac{d}{2})}{d}, \end{eqnarray}\]

  2. When using split-sample regressions or local linear regressions, you can use the estimated slope coefficients \(\hat{b}_{1}(x)\) as a gradient estimate.1

  3. More sophisticate methods that are beyond the scope of this class.

After computing gradients, you can summarize them in various plots: Histograms and Scatterplots. The confidence band only shows variability across samples, whereas these plots show variability within-sample. You can also plot all Gradients with their CI’s (Chaudhuri1999?; HendersonEtAl2012?).

Code
## Gradients
pred_lo <- predict(reg_lo)
grad_dx <- diff(dat[, 'school'])
grad_dy <- diff(pred_lo)
grad_lo <-grad_dy/grad_dx

## Visual Summary
par(mfrow=c(1, 2))
hist(grad_lo,  breaks=20,
    border=NA, freq=F,
    col=lo_col,
    xlab=expression(d~hat(y)/dx),
    main=NA) ## Distributional Summary
  
## Visual Summary 2
grad_x  <- dat[, 'school'][-1]
plot(grad_x+grad_dx, grad_lo,
    xlab='x', ylab=expression(d~hat(y)/dx),
    col=lo_col, pch=16, main=NA) ## Diminishing Returns?

You may also be interested in a particular gradient or a single summary statistic. For example, a bivariate regressogram can estimate the “marginal effect at the mean”; \(\hat{b}_{1}( x=\hat{M}_{X} )\). More typically, you would be interested in the “mean of the gradients”, sometimes said simply as “average effect”, which averages the gradients over all datapoints in the dataset: \(1/n \sum_{i}^{n} \hat{b}_{1}(x=\hat{X}_{i})\). Alternatively, you may be interested in the median of the gradients, or measures of “effect heterogeneity”: the interquartile range or standard deviation of the gradients. Such statistics are single numbers that can be presented in tabular form: “mean gradient (sd gradient)”. You can also report standard errors: “mean gradient (estimated se), sd gradient (estimated se)”.

Code
## Tabular Summary
tab_stats <- c(
    Mean=mean(grad_lo, na.rm=T),
    SD=sd(grad_lo, na.rm=T))

## Use bootstrap to approximate sampling dist
boot_stats <- matrix(NA, nrow=299, ncol=2)
colnames(boot_stats) <- c('Mean SE', 'SD SE')
for(b in 1:nrow(boot_stats)){
    xy_b <- dat[sample(1:nrow(dat), replace=T), ]
    reg_lo <- loess(wage ~ school, data=xy_b, span=.6)
    pred_lo <- predict(reg_lo)
    grad_lo <- diff(pred_lo)/diff(xy_b[, 'school'])
    dydx_mean <- mean(grad_lo, na.rm=T)
    dydx_sd <- sd(grad_lo, na.rm=T)
    boot_stats[b, 1] <- dydx_mean
    boot_stats[b, 2] <- dydx_sd
}
## SEs and CIs
boot_se <- apply(boot_stats, 2, sd)
boot_quants <- apply(boot_stats, 2, quantile, probs=c(0.025, 0.975))
boot_quants <- apply( round(boot_quants, 3), 2, paste0, collapse=', ')

## Printing
tab_regstyle <- data.frame(
  Estimate  = round(tab_stats, 3),
  SE = paste0('(', round(boot_se, 3), ')'),
  CI_95 = paste0('[', boot_quants, ']')
)
tab_regstyle
##      Estimate      SE          CI_95
## Mean    0.133 (0.044) [0.487, 0.655]
## SD      0.647 (0.065) [0.166, 0.433]
Summary of Local Gradients
Estimate Bootstrap.SE Bootstrap.95.CI
Mean 0.13 (0.044) [0.487, 0.655]
SD 0.65 (0.065) [0.166, 0.433]

Hypothesis Testing.

We can test whether the mean gradient is statistically different from zero. Ee can compute the p-value directly from the bootstrap distribution. Under \(H_0\), the mean gradient equals zero, so we center the bootstrap distribution at zero and ask how often it produces values as extreme as the observed mean gradient.

Code
## P-value via null bootstrap ECDF
## Center bootstrap means at zero (impose H0)
boot_centered <- boot_stats[, 'Mean SE'] - mean(boot_stats[, 'Mean SE'])

## ECDF of centered bootstrap distribution
boot_ecdf <- ecdf(boot_centered)

## Two-sided p-value: P(|centered mean| >= |observed mean|)
p_boot <- 1 - boot_ecdf(abs(tab_stats['Mean'])) +
              boot_ecdf(-abs(tab_stats['Mean']))

cat('Bootstrap p-value (ECDF):', format.pval(p_boot, digits=3), '\n')
## Bootstrap p-value (ECDF): 0.00669

The small p-value indicates that the mean gradient is statistically distinguishable from zero: on average, wages increase with schooling.

We can also use a normal approximation. Under the null hypothesis \(H_0\): the mean gradient equals zero, we use the test statistic \[t = \frac{\text{mean gradient}}{\text{SE(mean gradient)}}\] and the p-value comes from the standard normal distribution.

Code
## P-value for mean gradient
## Standard Normal Approximation
t_stat <- tab_stats['Mean'] / boot_se['Mean SE']
p_val  <- 2 * pnorm(-abs(t_stat))

cat('Mean gradient:', round(tab_stats['Mean'], 3), '\n')
## Mean gradient: 0.133
cat('Bootstrap SE: ', round(boot_se['Mean SE'], 3), '\n')
## Bootstrap SE:  0.044
cat('t-statistic:  ', round(t_stat, 3), '\n')
## t-statistic:   3.036
cat('p-value:      ', format.pval(p_val, digits=3), '\n')
## p-value:       0.0024

Compute the t-statistic and p-value using jackknife standard errors instead of the bootstrap.

Code
## Jackknife SE for mean gradient
n <- nrow(dat)
jack_means <- numeric(n)
for (i in 1:n) {
    dat_i  <- dat[-i, ]
    reg_i  <- loess(wage ~ school, data=dat_i, span=.6)
    pred_i <- predict(reg_i)
    grad_i <- diff(pred_i) / diff(dat_i[, 'school'])
    jack_means[i] <- mean(grad_i, na.rm=T)
}

## Jackknife SE
jack_se <- sqrt((n - 1) / n * sum((jack_means - mean(jack_means))^2))

## t-statistic and p-value
t_jack <- tab_stats['Mean'] / jack_se
p_jack <- 2 * pnorm(-abs(t_jack))

cat('Jackknife SE: ', round(jack_se, 3), '\n')
## Jackknife SE:  0.12
cat('t-statistic:  ', round(t_jack, 3), '\n')
## t-statistic:   1.114
cat('p-value:      ', format.pval(p_jack, digits=3), '\n')
## p-value:       0.265

15.2 Theory

We use simulations where the true data-generating process is known to better understand different models.

Sampling Distributions.

Bootstrap confidence bands are approximations: they estimate how much predicted values vary from sample to sample. The simulation below compares a bootstrap CI constructed from one sample to the true sampling variation observed across many independent samples.

Code
## Ages
Xmx <- 70
Xmn <- 15

##Generate Sample Data
dat_sim <- function(n=1000){
    X <- seq(Xmn, Xmx, length.out=n)
    ## Random Productivity
    e <- runif(n, 0, 1E6)
    beta <-  1E-10*exp(1.4*X -.015*X^2)
    Y    <-  (beta*X + e)/10
    return(data.frame(Y, X))
}
dat0 <- dat_sim(1000)
dat0 <- dat0[order(dat0[, 'X']), ]

## Data from one sample
plot(Y ~ X, data=dat0, pch=16, col=grey(0, .05),
    main=NA, ylab='Yearly Productivity ($)', xlab='Age' )
reg_lo <- loess(Y ~ X, data=dat0, span=.8)

## Plot Bootstrap CI for Single Sample
pred_design <- data.frame(X=seq(Xmn, Xmx))
preds_lo <- predict(reg_lo, newdata=pred_design)
boot_lo <- matrix(NA, nrow=nrow(pred_design), ncol=399)
for (b in 1:399) {
    dat0_i <- dat0[sample(nrow(dat0), replace=T), ]
    reg_i <- loess(Y ~ X, data=dat0_i, span=.8)
    boot_lo[, b] <- predict(reg_i, newdata=pred_design)
}
boot_cb <- apply(boot_lo, 1, quantile,
    probs=c(.025, .975), na.rm=T)
polygon(
    c(pred_design[[1]], rev(pred_design[[1]])),
    c(boot_cb[1, ], rev(boot_cb[2, ])),
    col=hcl.colors(3, alpha=.25)[2],
    border=NA)

# Construct CI across Multiple Samples
sample_lo <- matrix(NA, nrow=nrow(pred_design), ncol=399)
for (b in 1:399) {
    xy_b <- dat_sim(1000) #Entirely new sample
    reg_b <- loess(Y ~ X, data=xy_b, span=.8)
    sample_lo[, b] <- predict(reg_b, newdata=pred_design)
}
ci_lo <- apply(sample_lo, 1, quantile,
    probs=c(.025, .975), na.rm=T)
polygon(
    c(pred_design[[1]], rev(pred_design[[1]])),
    c(ci_lo[1, ], rev(ci_lo[2, ])),
    col=grey(0, alpha=.25),
    border=NA)

Bias.

The plot below shows the average prediction across \(B\) samples against the true mean. Notice that LOESS tracks the true mean closely. Notice that a linear model does not.

Code
set.seed(123)

## Shared parameters used by Bias and Variance
B        <- 300
n_bias   <- 200
Nseq     <- seq(50, 500, by=50)
span_lo  <- 0.5
true_m   <- function(x) (1E-10*exp(1.4*x - .015*x^2)*x + 5E5) / 10
x0       <- seq(Xmn, Xmx, length.out=120)
m0       <- true_m(x0)
pred_design <- data.frame(X=x0)

## Storage for Bias predictions (filled via <<- when n == n_bias)
pred_lm_bias <- matrix(NA_real_, nrow=length(x0), ncol=B)
pred_lo_bias <- matrix(NA_real_, nrow=length(x0), ncol=B)

## One sweep: gradient SEs across all n, plus bias predictions at n_bias
SE <- matrix(NA, nrow=2, ncol=length(Nseq))
for (ni in seq_along(Nseq)) {
    n <- Nseq[ni]
    stats <- matrix(NA, nrow=2, ncol=B)
    for (b in 1:B) {
        dat_b  <- dat_sim(n)
        fit_lm <- lm(Y ~ X, data=dat_b)
        fit_lo <- loess(Y ~ X, data=dat_b, span=0.8)
        grad_lo <- diff(predict(fit_lo))/diff(dat_b[, 'X'])
        if(n == n_bias){
            pred_lm_bias[, b] <- predict(fit_lm, newdata=pred_design)
            fit_lo_b <- loess(Y ~ X, data=dat_b, span=span_lo)
            pred_lo_bias[, b] <- predict(fit_lo_b, newdata=pred_design)
        }
        stats[, b] <- c(coef(fit_lm)[2], mean(grad_lo, na.rm=T))
    }
    SE[, ni] <- apply(stats, 1, sd)
}
Code
m_lm <- rowMeans(pred_lm_bias, na.rm=TRUE)
m_lo <- rowMeans(pred_lo_bias, na.rm=TRUE)

y_rng <- range(c(m0, m_lm, m_lo), na.rm=TRUE)
plot(x0, m0, type='l', lwd=2, col=rgb(0, 0, 0, .8), ylim=y_rng,
  xlab='Age', ylab=' Mean Productivity ($)', main=NA)
lines(x0[is.finite(m_lm)], m_lm[is.finite(m_lm)], lwd=2, col=rgb(1, 0, 0, .8))
lines(x0[is.finite(m_lo)], m_lo[is.finite(m_lo)], lwd=2, col=rgb(0, 0, 1, .8))
legend('topright', lty=1, lwd=2, col=c(rgb(0, 0, 0, .8), rgb(1, 0, 0, .8), rgb(0, 0, 1, .8)),
  legend=c('True mean',
    'Linear',
    paste0('Loess(', span_lo, ')')
))

Variance.

The model estimates also vary from sample to sample. Notice that the OLS slope coefficient generally varies less than the LOESS mean gradient, even though it is biased.

Also notice that there are diminishing returns to larger sample sizes. Both the OLS slope coefficient and the loess mean gradient vary less from sample to sample as \(n\) grows, making hypothesis tests more accurate.

Code
cols <- c(rgb(1, 0, 0, .8), rgb(0, 0, 1, .8))
matplot(Nseq, t(SE), type='b', pch=16,
    lty=1, lwd=2, col=cols,
    ylab='standard error', xlab='sample size',
    main=NA)
legend('topright',
    lty=1, lwd=2, col=cols,
    legend=c('OLS slope', 'Loess mean gradient'))

Bias-Variance Tradeoff.

Models face a bias-variance tradeoff.

  • High-bias / low-variance models. E.g., strict linear models are stable across samples but can miss curvature.
  • Low-bias / high-variance models. E.g., very flexible local linear models can capture curvature but may overfit noise.
Code
# Bias functions
mean_pred <- function(P) rowMeans(P, na.rm=TRUE)
avg_bias2 <- function(P) mean( (mean_pred(P) - m0)^2, na.rm=TRUE)

# Variance
var_pred <- function(P) apply(P, 1, var, na.rm=TRUE)
avg_var   <- function(P) mean(var_pred(P), na.rm=TRUE)

metrics <- rbind(
  c(avg_bias2(pred_lm_bias),  avg_var(pred_lm_bias)),
  c(avg_bias2(pred_lo_bias),  avg_var(pred_lo_bias))
)
colnames(metrics) <- c('Avg Bias^2', 'Avg Variance')
rownames(metrics) <- c('Linear', 'Loess')
round(metrics, 4)
##        Avg Bias^2 Avg Variance
## Linear  574480726      7770453
## Loess     5171493     25966091

Consistency.

The bias-variance tradeoff raises a question: can both be reduced simultaneously as \(n\) grows? For local regression, the answer is yes — by shrinking the bandwidth with \(n\). The neighborhood shrinks (reducing bias) while the local sample size still grows (reducing variance). This is often written as bandwidth conditions: \[\begin{eqnarray} h_n \to 0 \quad\text{and}\quad n h_n \to \infty. \end{eqnarray}\] In LOESS language, this corresponds to the span shrinking with \(n\), but not too fast.

The simulation below illustrates this at one target point \(x_0\): absolute error in estimating \(m(x_0)\) tends to decrease with larger \(n\).

Code
set.seed(42)

x0_target <- 40  # mid-career age
n_grid <- c(60, 120, 240, 480)
R <- 120

avg_abs_err <- numeric(length(n_grid))
for (ni in seq_along(n_grid)) {
  n <- n_grid[ni]
  span_n <- min(0.9, 1.8*n^(-1/4)) # shrinks with n
  errs <- replicate(R, {
    dat_b <- dat_sim(n)
    fit <- loess(Y ~ X, data=dat_b, span=span_n, degree=1)
    mhat <- predict(fit, newdata=data.frame(X=x0_target))
    abs(mhat - true_m(x0_target))
  })
  avg_abs_err[ni] <- mean(errs, na.rm=TRUE)
}

plot(n_grid, avg_abs_err, type='b', pch=16,
     xlab='Sample size (n)',
     ylab='Average Bias',
     main=NA)

A similar result holds for OLS when the true relationship is linear. Even with unlimited data, however, a misspecified model cannot recover the true conditional mean.

15.3 Exercises

  1. A linear regression has low variance but can have high bias when the true relationship is nonlinear. A loess regression with a small span has low bias but high variance. Explain why increasing the sample size \(n\) helps reduce the standard error for both models, but only helps reduce bias for loess (with an appropriate bandwidth rule).

  2. Using the USArrests dataset, fit a loess regression of Murder on UrbanPop with span = 0.6. Compute the finite-difference gradients \(\hat{b}_{1}(x)\) and report their mean and standard deviation. Compare these to the OLS slope \(\hat{b}_{1}\) from a simple linear regression.

  3. Using the USArrests dataset and the loess fit from the previous question, write R code to construct a bootstrap confidence band with \(B = 399\) replicates. Plot the scatterplot, the loess curve, and the shaded 95% confidence band.

Further Reading.


  1. One benefit of LLLS is that it is theoretically motivated: assuming that \(Y_{i}=m(X_{i}) + \epsilon_{i}\), we can then take a Taylor approximation: \(m(X_{i}) + \epsilon_{i} \approx m(x) + m'(x)[X_{i}-x] + \epsilon_{i} = [m(x) - m'(x)x ] + m'(x)X_{i} + \epsilon_{i} = b_{0}(x,h) + b_{1}(x,h) X_{i}\).↩︎