Scatterplots are a great and simplest plot for bivariate data that simply plots each observation. There are many extensions and similar tools. The example below helps understand how both the central tendency and dispersion change.
Code
# Local relationship: wages and educationlibrary(wooldridge)## Plot 1plot(wage~educ, data=wage1, pch=16, col=grey(0,.1))educ_means <-aggregate(wage~educ, data=wage1, mean)points(wage~educ, data=educ_means, pch=17, col='blue', type='b')title("Grouped Means and Scatterplot", font.main=1)
Code
## Plot 2 (Alternative for big datasets)# boxplot(wage~educ, data=wage1, # pch=16, col=grey(0,.1), varwidth=T)#title("Boxplots", font.main=1)## Plot 3 (Less informative!)#barplot(wage~educ, data=educ_means)#title("Bar Plot of Grouped Means", font.main=1)
Tip
Examine the local relationship between ‘Murder’ and ‘Urbanization’ in the USArrests dataset. Use custom bins.
To move beyond descriptive statistics, we will use models. We previously covered linear models, but you could be analyzing data with nonlinear relationships. So now we consider a general relationship: \[\begin{eqnarray}
\hat{Y}_{i} = m(\hat{X}_{i}) + e_{i},
\end{eqnarray}\] where \(m\) is an unknown function and \(e_{i}\) is white noise, which we estimate via different models. After minimizing the sum of squared errors in a the regressions below, our estimated models \(\hat{m}\) will also predict local averages.
14.2 Regressograms
Just as we use histograms to describe the distribution of a random variable, we can use a regressogram for conditional relationships. Specifically, we can use dummies for exclusive intervals or “bins” to estimate the average value of \(\hat{Y}_{i}\) for data inside the bin.
Code
# Dataplot(wage~educ, data=wage1, pch=16, col=grey(0,.1))dat <- wage1[order(wage1[,'educ']), c('wage','educ')]## Simple Regressionreg <-lm(wage~educ, data=dat) ## OLS# Regressogram: Course Age Binsdat[,'xcc'] <-cut(dat[,'educ'], 2)rgram_c <-lm(wage~xcc, data=dat)# Regressogram: Fine Age Binsdat[,'xcf'] <-cut(dat[,'educ'], 3)rgram_f <-lm(wage~xcf, data=dat)# Regressogram (Means for each level)dat[,'xd'] <-as.factor(dat[,'educ'])rgram_d <-lm(wage~xd, data=dat)## Compare Models (Only 2 for simplicity)lines( dat[,'educ'], predict(reg), lwd=2, col=2)lines( dat[,'educ'], predict(rgram_f), lwd=2, col=4, type='s')legend('topleft',legend=c('Linear Regression','Regressogram (3)'),col=c(2,4),lty=1, cex=.8)
To conduct a regressogram, first divide \(X\) into \(1,...L\) exclusive bins of half-width \(h\). Each bin has a midpoint, \(x\), and each observation has an associated dummy variable \(\hat{D}_{i}(x,h) = \mathbf{1}\left(\hat{X}_{i} \in \left(x-h,x+h\right] \right)\). Then we conduct a regression with this model: \[\begin{eqnarray}
\hat{Y}_{i} &=& \sum_{x \in \{x_{1}, ..., x_{L} \}} b_{0}(x,h) \hat{D}_{i}(x,h) + e_{i},
\end{eqnarray}\] where each bin has a coefficient \(b_{0}(x,h)\).
Local Averages.
When minimizing the sum of squared errors, the optimal coefficients are denoted as \(\hat{b}_{0}(x,h)\) and we can find them analytically. To do so, notice that each bin has \(n(x,h) = \sum_{i}^{n}\hat{D}_{i}(x,h)\) observations. This means we can split the dataset into parts associated with each bin \[\begin{eqnarray}
\label{eqn:regressogram1}
\sum_{i}^{n}\left[e_{i}\right]^2
&=& \sum_{i}^{n}\left[\hat{Y}_{i}- \sum_{x \in \{x_{1}, ..., x_{L} \}} b_{0}(x,h) \hat{D}_{i}(x,h) \right]^2 \\
&=& \sum_{i}^{n(x_{1},h)}\left[\hat{Y}_{i}- \sum_{x \in \{x_{1}, ..., x_{L} \}} b_{0}(x,h) \hat{D}_{i}(x,h) \right]^2
+ ... \sum_{i}^{n(x_{L},h)}\left[\hat{Y}_{i}- \sum_{x \in \{x_{1}, ..., x_{L} \}} b_{0}(x,h) \hat{D}_{i}(x,h) \right]^2 \\
&=& \sum_{i}^{n(x_{1},h)}\left[\hat{Y}_{i}- b_{0}\left(x_1,h\right) \right]^2 + ... \sum_{i}^{n(x_L,h)}\left[\hat{Y}_{i}-b_{0}\left(x_L,h\right) \right]^2 % +~ (N-1)\sum_{i}\hat{Y}_{i}.
\end{eqnarray}\] This separation allows us to analytically optimize for each bin separately \[\begin{eqnarray}
\label{eqn:regressogram2}
\min_{ \left\{ b_{0}(x,h) \right\} } \sum_{i}^{n}\left[e_{i}\right]^2
&=& \min_{ \left\{ b_{0}(x,h) \right\} } \sum_{i}^{n(x,h)}\left[\hat{Y}_{i}- b_{0}\left(x,h\right) \right]^2,
\end{eqnarray}\] In any case, minimizing yields the optimal coefficient as follows \[\begin{eqnarray}
0 &=& -2 \sum_{i}^{n(x,h)}\left[ \hat{Y}_{i} - b_{0}(x,h) \right] \\
\hat{b}_{0}(x,h) &=& \frac{\sum_{i}^{n(x,h)} \hat{Y}_{i}}{ n(x,h) } = \hat{M}_{Y}(x,h).
\end{eqnarray}\] As such, the OLS regression yields coefficients that are interpreted as the conditional mean: \(\hat{M}_{Y}(x,h)\). We can directly compute the same statistic directly by simply taking the average value of \(\hat{Y}_{i}\) for all \(i\) observations in a particular bin.
In any case, the values predicted by the model are found as \[\begin{eqnarray}
\hat{y}_{i} = \sum_{x} \hat{b}_{0}(x,h) \hat{D}_{i}(x,h) = \sum_{x} \hat{M}_{Y}(x,h) \hat{D}_{i}(x,h).
\end{eqnarray}\] I.e., the predicted value of observation \(i\) is the the average value for its bin.
Note
Consider this two-bin regressogram example of how age affects wage for people with \(\leq 9\) years of school complete vs \(> 9\). \[\begin{eqnarray}
\text{Wage}_{i} &=& b_{0}(x=4.5, h=9/2) \mathbf{1}\left(\text{Educ}_{i} \in (0,9]\right) + b_{0}(x=13.5, h=9/2) \mathbf{1}\left(\text{Age}_{i} \in (9,18] \right) + e_{i}.
\end{eqnarray}\]
Here is a simple example with three data points \((\hat{X}_{i}, \hat{Y}_{i}) \in \{ (1,2), (4,4), (12,3) \}\), we can easily compute \[\begin{eqnarray}
n(x=4.5,h=9/2) &=& 2\\
\hat{b}_{0}(x=4.5, h=9/2) &=& [2 + 4] / 2 \\
n(x=13.5,h=9/2) &=& 1\\
\hat{b}_{0}(x=13.5, h=9/2) &=& 3 / 1
\end{eqnarray}\]
The regressogram depicts locally constant relationships. We can also included slope terms within each bin to allow for locally linear relationships. This is often called segmented/piecewise regression, which runs a separate regression for different subsets of the data.
The model is \[\begin{eqnarray}
\hat{Y}_{i} &=& \sum_{x} \left[b_{0}(x,h) + b_{1}(x,h)\hat{X}_{i} \right] \hat{D}_{i}(x,h) + e_{i}.
\end{eqnarray}\] This same separation as above allows us to analytically optimize for each bin separately. I.e. we run separate regressions on the split samples. From the previous chapter on simple linear regression, we know the solutions are \[\begin{eqnarray}
\hat{b}_{0}(x,h) &=& \hat{M}_{Y}(x,h)-\hat{b}_{1}\hat{M}_{X}(x,h)\\
\hat{b}_{1}(x,h) &=& \frac{\sum_{i}^{n(x,h)}(\hat{X}_{i}-\hat{M}_{X}(x,h))(\hat{Y}_{i}-\hat{M}_{Y}(x,h))}{\sum_{i}^{n(x,h)}(\hat{X}_{i}-\hat{M}_{X}(x,h))^2} = \frac{\hat{C}_{XY}(x,h)}{\hat{V}_{X}(x,h)},
\end{eqnarray}\]
Note
Code
# See that two methods give the same predictions# Piecewise: Course Age Binsdat[,'xcc'] <-cut(dat[,'educ'], 2)preg_c <-lm(wage~xcc*educ, data=dat)pred2 <-predict(preg_c)## Split Sample Regressionsdat2 <-split( dat, dat[,'xcc'])pred2B <-vector("list", length(dat2))names(pred2B) <-names(dat2)for (i inseq_along(dat2)) { reg2 <-lm(wage~educ, dat2[[i]]) pred2B[[i]] <-predict(preg_c)}# Any differences?all( pred2 -unlist(pred2B) <1e-10 )## [1] TRUE
Tip
Compare a simple regression to a regressogram and a piecewise regression. Examine the relationship between ‘Murder’ and ‘Urbanization’ in the USArrests dataset.
Try to do so on your own before looking at the code below, which compares two piecewise regressions to a simple linear regression.
Code
# Diagnose Fit#plot( fitted(reg), resid(reg), pch=16, col=grey(0,.5))#plot( xy[,'x'], resid(reg), pch=16, col=grey(0,.5))# Linear in 2 Pieces (subsets)xcut2 <-cut(xy[,'x'],2)xy_list2 <-split(xy, xcut2)regs2 <-vector("list", length(xy_list2))names(regs2) <-names(xy_list2)for (i inseq_along(xy_list2)) { regs2[[i]] <-lm(y~x, data=xy_list2[[i]])}# Linear in 3 Pieces (subsets or bins)xcut3 <-cut(xy[,'x'], seq(32,92,by=20)) # Finer Binsxy_list3 <-split(xy, xcut3)regs3 <-vector("list", length(xy_list3))names(regs3) <-names(xy_list3)for (i inseq_along(xy_list3)) { regs3[[i]] <-lm(y~x, data=xy_list3[[i]])}## Make Predictionspred1 <-data.frame(yhat=predict(reg),x=reg[['model']][,'x'])pred1 <- pred1[order(pred1[,'x']),]pred2 <-vector("list", length(regs2))for (i inseq_along(regs2)) { pred2[[i]] <-data.frame(yhat=predict(regs2[[i]]),x=regs2[[i]][['model']][,'x'])}pred2 <-do.call(rbind,pred2)pred2 <- pred2[order(pred2[,'x']),]pred3 <-vector("list", length(regs3))for (i inseq_along(regs3)) { pred3[[i]] <-data.frame(yhat=predict(regs3[[i]]),x=regs3[[i]][['model']][,'x'])}pred3 <-do.call(rbind,pred3)pred3 <- pred3[order(pred3[,'x']),]# Compare Predictionsplot(y ~ x, pch=16, col=grey(0,.5), dat=xy)lines(yhat~x, pred1, lwd=2, col=2)lines(yhat~x, pred2, lwd=2, col=4)lines(yhat~x, pred3, lwd=2, col=3)legend('topleft',legend=c('Globally Linear','Piecewise Linear (2)','Piecewise Linear (3)'),lty=1, col=c(2,4,3), cex=.8)
For many things, a simple linear regression, regressograms, or piecewise regression is “good enough”. Simple linear regressions struggle with nonlinear relationships but are very easy to run with a computer. Regressograms and piecewise regressions are intuitive ways to capture nonlinear relationships that are computationally efficient but have obvious problems where the bins change. Sometimes we want smoother predictions or to estimate derivatives (gradients). To cover more advanced regression methods that do those things, we will need to first learn about kernel density estimation.
Weighted Regression.
Interestingly, we can obtain the same statistics from weighted least squares regression. For some specific design point, \(x\), we can find \(\hat{b}(x, h)\) by minimizing \[\begin{eqnarray}
\sum_{i}^{n}\left[ e_{i} \right]^2 \hat{D}_{i}(x,h) &=& \sum_{i}^{n}\left[ \hat{Y}_{i}- b_{0}(x,h) - b_{1}(x,h) \hat{X}_{i} \right]^2 \hat{D}_{i}(x,h) \\
&=& \sum_{i}^{n(x_{1},h)}\left[ \hat{Y}_{i}- b_{0}(x_{1},h) - b_{1}(x_{1},h) \hat{X}_{i} \right]^2 \hat{D}_{i}(x_{1},h) + ... \sum_{i}^{n(x_{L},h)}\left[ \hat{Y}_{i}- b_{0}(x_{L},h) - b_{1}(x_{L},h) \hat{X}_{i} \right]^2 \hat{D}_{i}(x_{L},h) \\
&=& \sum_{i}^{n(x,h)}\left[\hat{Y}_{i}- b_{0}\left(x,h\right) - b_{1}(x,h) \hat{X}_{i} \right]^2
\end{eqnarray}\]
We get nearly identical results if we instead use “uniform weights” with half-width \(h\), \[\begin{eqnarray}
k_{U}\left( \hat{X}_{i}, x, h \right)
&=& \frac{\mathbf{1}\left(\frac{|\hat{X}_{i}-x|}{h} \leq 1\right)}{2}
= \frac{\mathbf{1}\left( \hat{X}_{i} \in \left[ x-h, x + h\right] \right) }{2} \\
\end{eqnarray}\] with is nearly identical to \(\hat{D}_{i}(x,h)/ 2\), but uses intervals \([]\) instead of \((]\). As such we can see that \[\begin{eqnarray}
\sum_{i}^{n}\left[ e_{i} \right]^2 k_{U}\left( \hat{X}_{i}, x, h \right)
&\approx& \sum_{i}^{n}\left[ e_{i} \right]^2 \hat{D}_{i}(x,h) / 2 \\
&=& \sum_{i}^{n(x,h)}\left[\hat{Y}_{i}- b_{0}\left(x,h\right) - b_{1}(x,h) \hat{X}_{i} \right]^2 / 2,
\end{eqnarray}\] The constant term \(1/2\) is irrelevant to finding the optimal solution (you can check the math yourself).
Tip
Here is an example going into the details of the weights.
#library(wooldridge) # Data from before#dat <- wage1[order(wage1[,'educ']), c('wage','educ')]## Weighted Regressionx <-4.5h <-9.036/2#makes bin large enough to include 0: (-0.018, 9]k <-abs(dat[,'educ']-x) /hk_weights <-dunif(k) / (2*h)preg_k <-lm(wage~educ, data=dat, weights=k_weights)predict(preg_k, newdata=data.frame(educ=x))## 1 ## 3.911792# Compare to Piecewisedat[,'xcc'] <-cut(dat[,'educ'], 2)preg_c <-lm(wage~xcc*educ, data=dat)predict(preg_c, newdata=data.frame(educ=x, xcc='(-0.018,9]'))## 1 ## 3.911792
14.4 Locally Constant Regression
Consider a point \(x\) and model where values are constant around \(x\); \(\hat{Y}_{i} = b_{0}(x, h) + e_{i}\). Then notice a weighted OLS estimator with uniform kernel weights yields \[\begin{eqnarray} \label{eqn:lcls}
& & \min_{b_{0}(x,h)}~ \sum_{i}^{n}\left[e_{i} \right]^2 k_{U}\left( \hat{X}_{i}, x, h \right) \\
\Rightarrow & & -2 \sum_{i}^{n}\left[\hat{Y}_{i}- b_{0}(x, h) \right] k_{U}\left(\hat{X}_{i}, x, h\right) = 0\\
\label{eqn:lcls1}
\Rightarrow & & \hat{b}_{0}(x)
= \frac{\sum_{i} \hat{Y}_{i} k_{U} \left( \hat{X}_{i}, x, h \right) }{ \sum_{i} k_{U}\left( \hat{X}_{i}, x, h \right) }
= \sum_{i} \hat{Y}_{i} \left[ \frac{ k_{U} \left( \hat{X}_{i}, x, h \right) }{ \sum_{i} k_{U}\left( \hat{X}_{i}, x, h \right)} \right]
= \sum_{i} \hat{Y}_{i} w_{i}(x, h),
\end{eqnarray}\] where weights are determined from \[\begin{eqnarray}
\sum_{i}^{n} k_{U} \left( \hat{X}_{i}, x, h \right) &=& \sum_{i}^{n(x,h)} k_{U} \left( \hat{X}_{i}, x, h \right) + \sum_{i}^{n - n(x,h)} 0 = \frac{n(x, h)}{2}\\
w_{i}(x, h) &=& \left[ \frac{ k_{U} \left( \hat{X}_{i}, x, h \right) }{ \sum_{i} k_{U}\left( \hat{X}_{i}, x, h \right)} \right]
= \frac{\mathbf{1}\left( |\hat{X}_{i} - x| \leq h \right)}{n(x, h)}
\end{eqnarray}\] So locally constant kernel regression recovers the weighted mean of \(Y_{i}\) around design point \(x\). If we use exclusive bins, then we are essentially running a regressogram. As such, the regressogram is more crude but can be estimated more easily.
Note
See how the results are nearly identical below. Then consider different design points.
When \(n\) is small, \(\hat{b}_{U}(x, h)\) is typically estimated for each unique observed value: \(x \in \{ x_{1},...x_{n} \}\). For large datasets, you can select a subset or evenly spaced values of \(x\) for which to make predictions.
If \(\hat{X}_{i}\) represents time, then the local constant regressions is also called a moving average. We can create a weighted moving average by using a different distribution. I.e. using the Normal dnorm instead of the Uniform dunif.
The basic idea also generalizes other kernels. As such, a kernel regression using uniform weights is often called a “naive kernel regression”. Typically, kernel regressions use kernels that weight nearby observations more heavily.
Bias
The regressogram has a bias at the edges of the bins, which is addressed by the local constant regression. Still, the local constant regression has a bias near the edges of the dataset. (The kernel divides by \(2h\) assuming there is data on both sides of the design point). This can be addressed by renormalizing weights.
Tip
Here is an example going into the details of the weights.
A less simple case is a local linear regression which conducts a linear regression for each data point using a subsample of data around it. Consider a point \(x\) and model \(\hat{Y}_{i} = b_{0}(x,h) + b_{1}(x) \hat{X}_{i} + e_{i}\) for data near \(x\). The weighted OLS estimator with kernel weights is \[\begin{eqnarray}
& & \min_{b_{0}(x, h),b_{1}(x, h)}~ \sum_{i}^{n}\left[\hat{Y}_{i}- b_{0}(x, h) - b_{1}(x,h) \hat{X}_{i} \right]^2 k_{U}\left( \hat{X}_{i}, x, h\right)
\end{eqnarray}\] Deriving the optimal values \(\hat{b}_{0}(x, h)\) and \(\hat{b}_{1}(x,h)\) for \(k_{U}\) is left as a homework exercise.
Code
# ``Naive" Smootherpred_fun <-function(x0, h){# Assign equal weight to observations within h distance to x0# 0 weight for all other observations ki <-abs(dat[,'educ']-x0)/h ki <-dunif(ki)/(2*h) ## Could change, e.g. dnorm(ki)/h wi <- ki/sum(ki, na.rm=T) # always sum to 1 (for edge-effects)# run regression with weighted data llls_i <-lm(wage~educ, data=dat, weights=wi) yhat_i <-predict(llls_i, newdata=data.frame(educ=x0))}X0 <-seq(0,18, by=0.5) # Design points# Fine Binspred_lo1 <-vector(length=length(X0))for(i inseq_along(X0)){ pred_lo1[i] <-pred_fun(x0=X0[i], h=2)}# Course Binspred_lo2 <-vector(length=length(X0))for(i inseq_along(X0)){ pred_lo2[i] <-pred_fun(x0=X0[i], h=6)}# Plotplot(wage~educ, pch=16, data=wage1, col=grey(0,.1),ylab='Murder Rate', xlab='Population Density')cols <-c(rgb(.8,0,0,.5), rgb(0,0,.8,.5))lines(X0, pred_lo1, col=cols[1], lwd=1, type='o')lines(X0, pred_lo2, col=cols[2], lwd=1, type='o')legend('topleft', title='Locally Linear',legend=c('h=2', 'h=6'),lty=1, col=cols, cex=.8)
Note
Compare local linear and local constant regressions using https://shinyserv.es/shiny/kreg/, with degree \(0\) and \(1\). Also try different kernels and different datasets.
Tip
Examine the local relationship between ‘Murder’ and ‘Urbanization’ in the USArrests dataset using LLLS.
An important extension of locally linear regressions is called loess, which uses adaptive bandwidths in order to have a similar number of data points around each design point. This is especially useful when \(\hat{X}_{i}\) is not uniform.