Suppose we have some bivariate data. First, we inspect it as in Part I.
Code
# Bivariate Data from USArrestsxy <- USArrests[,c('Murder','UrbanPop')]colnames(xy) <-c('y','x')# Inspect Dataset# head(xy)# summary(xy)plot(y~x, xy, col=grey(0,.5), pch=16)title('Murder and Urbanization in America 1975', font.main=1)
Now we will assess the association between variables by fitting a line through the data points using a “regression”.
10.1 Simple Linear Regression
This refers to fitting a linear model to bivariate data. Specifically, our model is \[
y_i=\beta_{0}+\beta_{1} x_i+\epsilon_{i}
\] and our objective function is \[
min_{\beta_{0}, \beta_{1}} \sum_{i=1}^{N} \left( \epsilon_{i} \right)^2 = min_{\beta_{0}, \beta_{1}} \sum_{i=1} \left( y_i - [\beta_{0}+\beta_{1} x_i] \right).
\] Minimizing the sum of squared errors yields parameter estimates \[
\hat{\beta_{0}}=\bar{Y}-\hat{\beta_{1}}\bar{X} \\
\hat{\beta_{1}}=\frac{\sum_{i}^{}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i}^{}(x_i-\bar{x})^2} = \frac{C_{XY}}{V_{X}}
\] and predictions \[
\hat{y}_i=\hat{\beta_{0}}+\hat{\beta}x_i\\
\hat{\epsilon}_i=y_i-\hat{y}_i
\]
Code
# Run a Regression Coefficientsreg <-lm(y~x, dat=xy)# predict(reg)# resid(reg)# coef(reg)
Goodness of Fit.
First, we qualitatively analyze the ‘’Goodness of fit’’ of our model, we plot our predictions for a qualitative analysis
Code
# Plot Data and Predictionslibrary(plotly)xy$ID <-rownames(USArrests)xy$pred <-predict(reg)xy$resid <-resid(reg)fig <- plotly::plot_ly( xy, x=~x, y=~y,mode='markers',type='scatter',hoverinfo='text',marker=list(color=grey(0,.25), size=10),text=~paste('<b>', ID, '</b>','<br>Urban :', x,'<br>Murder :', y,'<br>Predicted Murder :', round(pred,2),'<br>Residual :', round(resid,2))) # Add Legendfig <- plotly::layout(fig,showlegend=F,title='Crime and Urbanization in America 1975',xaxis =list(title='Percent of People in an Urban Area'),yaxis =list(title='Homicide Arrests per 100,000 People'))# Plot Model Predictionsadd_trace(fig, x=~x, y=~pred,inherit=F, hoverinfo='none',mode='lines+markers', type='scatter',color=I('black'),line=list(width=1/2),marker=list(symbol=134, size=5))
For a quantitative summary, we can also compute the linear correlation between the predictions and the data \[
R = Cor( \hat{y}_i, y)
\] With linear models, we typically compute \(R^2\), known as the “coefficient of determination”, using the sums of squared errors (Total, Explained, and Residual) \[
\underbrace{\sum_{i}(y_i-\bar{y})^2}_\text{TSS}=\underbrace{\sum_{i}(\hat{y}_i-\bar{y})^2}_\text{ESS}+\underbrace{\sum_{i}\hat{\epsilon_{i}}^2}_\text{RSS}\\
R^2 = \frac{ESS}{TSS}=1-\frac{RSS}{TSS}
\]
A regression coefficient is a statistic. And, just like all statistics, we can calculate
standard deviation: variability within a single sample.
standard error: variability across different samples.
confidence interval: range your statistic varies across different samples.
Note that values reported by your computer do not necessarily satisfy this definition. To calculate these statistics, we will estimate variability using data-driven methods. (For some theoretical background, see, e.g., https://www.sagepub.com/sites/default/files/upm-binaries/21122_Chapter_21.pdf.)
Jackknife.
We first consider the simplest, the jackknife. In this procedure, we loop through each row of the dataset. And, in each iteration of the loop, we drop that observation from the dataset and reestimate the statistic of interest. We then calculate the standard deviation of the statistic across all ``subsamples’’.
Code
# Jackknife Standard Errors for OLS Coefficientjack_regs <-lapply(1:nrow(xy), function(i){ xy_i <- xy[-i,] reg_i <-lm(y~x, dat=xy_i)})jack_coefs <-sapply(jack_regs, coef)['x',]jack_se <-sd(jack_coefs)# classic_se <- sqrt(diag(vcov(reg)))[['x']]# Jackknife Sampling Distributionhist(jack_coefs, breaks=25,main=paste0('SE est. = ', round(jack_se,4)),font.main=1, border=NA,xlab=expression(beta[-i]))# Original Estimateabline(v=coef(reg)['x'], lwd=2)# Jackknife Confidence Intervalsjack_ci_percentile <-quantile(jack_coefs, probs=c(.025,.975))abline(v=jack_ci_percentile, lty=2)
There are several resampling techniques. The other main one is the bootstrap, which resamples with replacement for an arbitrary number of iterations. When bootstrapping a dataset with \(n\) observations, you randomly resample all \(n\) rows in your data set \(B\) times. Random subsampling is one of many hybrid approaches that tries to combine the best of both worlds.
We can also bootstrap other statistics, such as a t-statistic or \(R^2\). We do such things to test a null hypothesis, which is often ``no relationship’’. We are rarely interested in computing standard errors and conducting hypothesis tests for two variables. However, we work through the ideas in the two-variable case to better understand the multi-variable case.
10.3 Hypothesis Tests
Invert a CI.
One main way to conduct hypothesis tests is to examine whether a confidence interval contains a hypothesized value. Does the slope coefficient equal \(0\)? For reasons we won’t go into in this class, we typically normalize the coefficient by its standard error: \[ \hat{t} = \frac{\hat{\beta}}{\hat{\sigma}_{\hat{\beta}}} \]
We can also compute a null distribution. We focus on the simplest: bootstrap simulations that each impose the null hypothesis and re-estimate the statistic of interest. Specifically, we compute the distribution of t-values on data with randomly reshuffled outcomes (imposing the null), and compare how extreme the observed value is.
Alternatively, you can impose the null by recentering the sampling distribution around the theoretical value; \[\hat{t} = \frac{\hat{\beta} - \beta_{0} }{\hat{\sigma}_{\hat{\beta}}}.\] Under some assumptions, the null distribution follows a t-distribution. (For more on parametric t-testing based on statistical theory, see https://www.econometrics-with-r.org/4-lrwor.html.)
In any case, we can calculate a p-value: the probability you would see something as extreme as your statistic under the null (assuming your null hypothesis was true). We can always calculate a p-value from an explicit null distribution.
Code
# One Sided Test for P(t > boot_t | Null) = 1 - P(t < boot_t | Null)That_NullDist1 <-ecdf(boot_t0)Phat1 <-1-That_NullDist1(jack_t)# Two Sided Test for P(t > jack_t or t < -jack_t | Null)That_NullDist2 <-ecdf(abs(boot_t0))plot(That_NullDist2, xlim=range(boot_t0, jack_t),xlab=expression( abs(hat(t)[b]) ),main='Null Bootstrap Distribution', font.main=1)abline(v=tvalue, col='red')
It is generally safe to assume that you could be analyzing data with nonlinear relationships. Here, our model can be represented as \[\begin{eqnarray}
y_{i} = m(x_{i}) + e_{i},
\end{eqnarray}\] with \(m\) being some unknown but smooth function. In such cases, linear regressions can still be useful.
Piecewise Regression.
The simplest case is segmented/piecewise regression, which runs a separate regression for different subsets of the data.
Code
# Globally Linearreg <-lm(y~x, data=xy)# Diagnose Fit#plot( fitted(reg), resid(reg), pch=16, col=grey(0,.5))#plot( xy$x, resid(reg), pch=16, col=grey(0,.5))# Linear in 2 Pieces (subsets)xcut2 <-cut(xy$x,2)xy_list2 <-split(xy, xcut2)regs2 <-lapply(xy_list2, function(xy_s){lm(y~x, data=xy_s)})sapply(regs2, coef)## (31.9,61.5] (61.5,91.1]## (Intercept) -0.2836303 4.15337509## x 0.1628157 0.04760783# Linear in 3 Pieces (subsets or bins)xcut3 <-cut(xy$x, seq(32,92,by=20)) # Finer Binsxy_list3 <-split(xy, xcut3)regs3 <-lapply(xy_list3, function(xy_s){lm(y~x, data=xy_s)})sapply(regs3, coef)## (32,52] (52,72] (72,92]## (Intercept) 4.60313390 2.36291848 8.653829140## x 0.08233618 0.08132841 -0.007174454
A less simple case is a local linear regression which conducts a linear regression for each data point using a subsample of data around it.
Code
# ``Naive" Smootherpred_fun <-function(x0, h, xy){# Assign equal weight to observations within h distance to x0# 0 weight for all other observations ki <-dunif(xy$x, x0-h, x0+h) llls <-lm(y~x, data=xy, weights=ki) yhat_i <-predict(llls, newdata=data.frame(x=x0))}X0 <-sort(unique(xy$x))pred_lo1 <-sapply(X0, pred_fun, h=2, xy=xy)pred_lo2 <-sapply(X0, pred_fun, h=20, xy=xy)plot(y~x, pch=16, data=xy, col=grey(0,.5),ylab='Murder Rate', xlab='Population Density')cols <-c(rgb(.8,0,0,.5), rgb(0,0,.8,.5))lines(X0, pred_lo1, col=cols[1], lwd=1, type='o')lines(X0, pred_lo2, col=cols[2], lwd=1, type='o')legend('topleft', title='Locally Linear',legend=c('h=2 ', 'h=20'),lty=1, col=cols, cex=.8)
Note that there are more complex versions of local linear regressions (see https://shinyserv.es/shiny/kreg/ for a nice illustration.) An even more complex (and more powerful) version is loess, which uses adaptive bandwidths in order to have a similar number of data points in each subsample (especially useful when \(X\) is not uniform.)