Scatterplots are a great and simplest plot for bivariate data that simply plots each observation. There are many extensions and similar tools. The example below helps understand how both the central tendency and dispersion change.
Code
# Local relationship: wages and educationlibrary(Ecdat)## Plot 1plot(wage ~ school, data=Wages1, pch=16, col=grey(0, .1),main=NA, xlab='School', ylab='Wage')school_means <-aggregate(wage ~ school, data=Wages1, mean)points(wage ~ school, data=school_means, pch=17, col=rgb(0, 0, 1, .8), type='b')title('Grouped Means and Scatterplot', font.main=1)
Code
## Plot 2 (Alternative for big datasets)# boxplot(wage ~ school, data=Wages1,# pch=16, col=grey(0, .1), varwidth=T)#title('Boxplots', font.main=1)## Plot 3 (Less informative!)#barplot(wage ~ school, data=school_means)#title('Bar Plot of Grouped Means', font.main=1)
TipTest Yourself
Examine the local relationship between ‘Murder’ and ‘Urbanization’ in the USArrests dataset. Use custom bins.
To move beyond descriptive statistics, we will use models. We previously covered linear models, but you could be analyzing data with nonlinear relationships. So now we consider a general relationship: \[\begin{eqnarray}
\hat{Y}_{i} = m(\hat{X}_{i}) + e_{i},
\end{eqnarray}\] where \(m\) is an unknown function and \(e_{i}\) is white noise, which we estimate via different models. After minimizing the sum of squared errors in a the regressions below, our estimated models \(\hat{m}\) will also predict local averages.
14.2 Regressograms
Just as we use histograms to describe the distribution of a random variable, we can use a regressogram for conditional relationships. Specifically, we can use dummies for exclusive intervals or “bins” to estimate the average value of \(\hat{Y}_{i}\) for data inside the bin.
To conduct a regressogram, first divide \(X\) into \(1,...L\) exclusive bins of half-width \(h\). Each bin has a midpoint, \(x\), and each observation has an associated dummy variable \(\hat{D}_{i}(x,h) = \mathbf{1}\left(\hat{X}_{i} \in \left(x-h,x+h\right] \right)\). Then we conduct a regression with this model: \[\begin{eqnarray}
\hat{Y}_{i} &=& \sum_{x \in \{x_{1}, ..., x_{L} \}} b_{0}(x,h) \hat{D}_{i}(x,h) + e_{i},
\end{eqnarray}\] where each bin has a coefficient \(b_{0}(x,h)\).
Local Averages.
When minimizing the sum of squared errors, the optimal coefficients are denoted as \(\hat{b}_{0}(x,h)\) and we can find them analytically. To do so, notice that each bin has \(n(x,h) = \sum_{i}^{n}\hat{D}_{i}(x,h)\) observations. This means we can split the dataset into parts associated with each bin \[\begin{eqnarray}
\label{eqn:regressogram1}
\sum_{i}^{n}\left[e_{i}\right]^2
&=& \sum_{i}^{n}\left[\hat{Y}_{i}- \sum_{x \in \{x_{1}, ..., x_{L} \}} b_{0}(x,h) \hat{D}_{i}(x,h) \right]^2 \\
&=& \sum_{i}^{n(x_{1},h)}\left[\hat{Y}_{i}- \sum_{x \in \{x_{1}, ..., x_{L} \}} b_{0}(x,h) \hat{D}_{i}(x,h) \right]^2
+ ... \sum_{i}^{n(x_{L},h)}\left[\hat{Y}_{i}- \sum_{x \in \{x_{1}, ..., x_{L} \}} b_{0}(x,h) \hat{D}_{i}(x,h) \right]^2 \\
&=& \sum_{i}^{n(x_{1},h)}\left[\hat{Y}_{i}- b_{0}\left(x_1,h\right) \right]^2 + ... \sum_{i}^{n(x_L,h)}\left[\hat{Y}_{i}-b_{0}\left(x_L,h\right) \right]^2 % +~ (N-1)\sum_{i}\hat{Y}_{i}.
\end{eqnarray}\] This separation allows us to analytically optimize for each bin separately \[\begin{eqnarray}
\label{eqn:regressogram2}
\min_{ \left\{ b_{0}(x,h) \right\} } \sum_{i}^{n}\left[e_{i}\right]^2
&=& \min_{ \left\{ b_{0}(x,h) \right\} } \sum_{i}^{n(x,h)}\left[\hat{Y}_{i}- b_{0}\left(x,h\right) \right]^2,
\end{eqnarray}\] In any case, minimizing yields the optimal coefficient as follows \[\begin{eqnarray}
0 &=& -2 \sum_{i}^{n(x,h)}\left[ \hat{Y}_{i} - b_{0}(x,h) \right] \\
\hat{b}_{0}(x,h) &=& \frac{\sum_{i}^{n(x,h)} \hat{Y}_{i}}{ n(x,h) } = \hat{M}_{Y}(x,h).
\end{eqnarray}\] As such, the OLS regression yields coefficients that are interpreted as the conditional mean: \(\hat{M}_{Y}(x,h)\). We can directly compute the same statistic directly by simply taking the average value of \(\hat{Y}_{i}\) for all \(i\) observations in a particular bin.
In any case, the values predicted by the model are found as \[\begin{eqnarray}
\hat{y}_{i} = \sum_{x} \hat{b}_{0}(x,h) \hat{D}_{i}(x,h) = \sum_{x} \hat{M}_{Y}(x,h) \hat{D}_{i}(x,h).
\end{eqnarray}\] I.e., the predicted value of observation \(i\) is the the average value for its bin.
NoteMust Know
Consider this two-bin regressogram example of how age affects wage for people with \(\leq 9\) years of school complete vs \(> 9\). \[\begin{eqnarray}
\text{Wage}_{i} &=& b_{0}(x=4.5, h=9/2) \mathbf{1}\left(\text{Educ}_{i} \in (0,9]\right) + b_{0}(x=13.5, h=9/2) \mathbf{1}\left(\text{Age}_{i} \in (9,18] \right) + e_{i}.
\end{eqnarray}\]
Here is a simple example with three data points \((\hat{X}_{i}, \hat{Y}_{i}) \in \{ (1,2), (4,4), (12,3) \}\), we can easily compute \[\begin{eqnarray}
n(x=4.5,h=9/2) &=& 2\\
\hat{b}_{0}(x=4.5, h=9/2) &=& [2 + 4] / 2 \\
n(x=13.5,h=9/2) &=& 1\\
\hat{b}_{0}(x=13.5, h=9/2) &=& 3 / 1
\end{eqnarray}\]
The regressogram depicts locally constant relationships. We can also included slope terms within each bin to allow for locally linear relationships. This is often called segmented/piecewise regression, which runs a separate regression for different subsets of the data.
The model is \[\begin{eqnarray}
\hat{Y}_{i} &=& \sum_{x} \left[b_{0}(x,h) + b_{1}(x,h)\hat{X}_{i} \right] \hat{D}_{i}(x,h) + e_{i}.
\end{eqnarray}\] This same separation as above allows us to analytically optimize for each bin separately. I.e. we run separate regressions on the split samples. From the previous chapter on simple linear regression, we know the solutions are \[\begin{eqnarray}
\hat{b}_{0}(x,h) &=& \hat{M}_{Y}(x,h)-\hat{b}_{1}\hat{M}_{X}(x,h)\\
\hat{b}_{1}(x,h) &=& \frac{\sum_{i}^{n(x,h)}(\hat{X}_{i}-\hat{M}_{X}(x,h))(\hat{Y}_{i}-\hat{M}_{Y}(x,h))}{\sum_{i}^{n(x,h)}(\hat{X}_{i}-\hat{M}_{X}(x,h))^2} = \frac{\hat{C}_{XY}(x,h)}{\hat{V}_{X}(x,h)},
\end{eqnarray}\]
NoteMust Know
Code
# See that two methods give the same predictions# Piecewise: Course Age Binsdat[, 'xcc'] <-cut(dat[, 'school'], 2)preg_c <-lm(wage ~ xcc*school, data=dat)pred2 <-predict(preg_c)## Split Sample Regressionsdat2 <-split( dat, dat[, 'xcc'])pred2B <-vector('list', length(dat2))names(pred2B) <-names(dat2)for (i inseq_along(dat2)) { reg2 <-lm(wage ~ school, dat2[[i]]) pred2B[[i]] <-predict(preg_c)}# Any differences?all( pred2 -unlist(pred2B) <1e-10 )## [1] TRUE
TipTest Yourself
Compare a simple regression to a regressogram and a piecewise regression. Examine the relationship between ‘Murder’ and ‘Urbanization’ in the USArrests dataset.
For many things, a simple linear regression, regressograms, or piecewise regression is “good enough”. Simple linear regressions struggle with nonlinear relationships but are very easy to run with a computer. Regressograms and piecewise regressions are intuitive ways to capture nonlinear relationships that are computationally efficient but have obvious problems where the bins change. Sometimes we want smoother predictions or to estimate derivatives (gradients). To cover more advanced regression methods that do those things, we will need to first learn about kernel density estimation.
Weighted Regression.
Interestingly, we can obtain the same statistics from weighted least squares regression. For some specific design point, \(x\), we can find \(\hat{b}(x, h)\) by minimizing \[\begin{eqnarray}
\sum_{i}^{n}\left[ e_{i} \right]^2 \hat{D}_{i}(x,h) &=& \sum_{i}^{n}\left[ \hat{Y}_{i}- b_{0}(x,h) - b_{1}(x,h) \hat{X}_{i} \right]^2 \hat{D}_{i}(x,h) \\
&=& \sum_{i}^{n(x_{1},h)}\left[ \hat{Y}_{i}- b_{0}(x_{1},h) - b_{1}(x_{1},h) \hat{X}_{i} \right]^2 \hat{D}_{i}(x_{1},h) + ... \sum_{i}^{n(x_{L},h)}\left[ \hat{Y}_{i}- b_{0}(x_{L},h) - b_{1}(x_{L},h) \hat{X}_{i} \right]^2 \hat{D}_{i}(x_{L},h) \\
&=& \sum_{i}^{n(x,h)}\left[\hat{Y}_{i}- b_{0}\left(x,h\right) - b_{1}(x,h) \hat{X}_{i} \right]^2
\end{eqnarray}\]
We get nearly identical results if we instead use “uniform weights” with half-width \(h\), \[\begin{eqnarray}
k_{U}\left( \hat{X}_{i}, x, h \right)
&=& \frac{\mathbf{1}\left(\frac{|\hat{X}_{i}-x|}{h} \leq 1\right)}{2}
= \frac{\mathbf{1}\left( \hat{X}_{i} \in \left[ x-h, x + h\right] \right) }{2} \\
\end{eqnarray}\] with is nearly identical to \(\hat{D}_{i}(x,h)/ 2\), but uses intervals \([]\) instead of \((]\). As such we can see that \[\begin{eqnarray}
\sum_{i}^{n}\left[ e_{i} \right]^2 k_{U}\left( \hat{X}_{i}, x, h \right)
&\approx& \sum_{i}^{n}\left[ e_{i} \right]^2 \hat{D}_{i}(x,h) / 2 \\
&=& \sum_{i}^{n(x,h)}\left[\hat{Y}_{i}- b_{0}\left(x,h\right) - b_{1}(x,h) \hat{X}_{i} \right]^2 / 2,
\end{eqnarray}\] The constant term \(1/2\) is irrelevant to finding the optimal solution (you can check the math yourself).
TipTest Yourself
Here is an example going into the details of the weights.
library(Ecdat) # Data from beforedat <- Wages1[order(Wages1[, 'school']), c('wage', 'school')]## Weighted Regressionx <-4.5h <-9.036/2#makes bin large enough to include 0: (-0.018, 9]k <-abs(dat[, 'school']-x) /hk_weights <-dunif(k) / (2*h)preg_k <-lm(wage ~ school, data=dat, weights=k_weights)predict(preg_k, newdata=data.frame(school=x))## 1 ## 4.551132# Compare to Piecewisedat[, 'xcc'] <-cut(dat[, 'school'], 2)preg_c <-lm(wage ~ xcc*school, data=dat)predict(preg_c, newdata=data.frame(school=x, xcc='(9.5,16]'))## 1 ## 1.090272
14.4 Locally Constant Regression
Consider a point \(x\) and model where values are constant around \(x\); \(\hat{Y}_{i} = b_{0}(x, h) + e_{i}\). Then notice a weighted OLS estimator with uniform kernel weights yields \[\begin{eqnarray} \label{eqn:lcls}
& & \min_{b_{0}(x,h)}~ \sum_{i}^{n}\left[e_{i} \right]^2 k_{U}\left( \hat{X}_{i}, x, h \right) \\
\Rightarrow & & -2 \sum_{i}^{n}\left[\hat{Y}_{i}- b_{0}(x, h) \right] k_{U}\left(\hat{X}_{i}, x, h\right) = 0\\
\label{eqn:lcls1}
\Rightarrow & & \hat{b}_{0}(x)
= \frac{\sum_{i} \hat{Y}_{i} k_{U} \left( \hat{X}_{i}, x, h \right) }{ \sum_{i} k_{U}\left( \hat{X}_{i}, x, h \right) }
= \sum_{i} \hat{Y}_{i} \left[ \frac{ k_{U} \left( \hat{X}_{i}, x, h \right) }{ \sum_{i} k_{U}\left( \hat{X}_{i}, x, h \right)} \right]
= \sum_{i} \hat{Y}_{i} w_{i}(x, h),
\end{eqnarray}\] where weights are determined from \[\begin{eqnarray}
\sum_{i}^{n} k_{U} \left( \hat{X}_{i}, x, h \right) &=& \sum_{i}^{n(x,h)} k_{U} \left( \hat{X}_{i}, x, h \right) + \sum_{i}^{n - n(x,h)} 0 = \frac{n(x, h)}{2}\\
w_{i}(x, h) &=& \left[ \frac{ k_{U} \left( \hat{X}_{i}, x, h \right) }{ \sum_{i} k_{U}\left( \hat{X}_{i}, x, h \right)} \right]
= \frac{\mathbf{1}\left( |\hat{X}_{i} - x| \leq h \right)}{n(x, h)}
\end{eqnarray}\] So locally constant kernel regression recovers the weighted mean of \(Y_{i}\) around design point \(x\). If we use exclusive bins, then we are essentially running a regressogram. As such, the regressogram is more crude but can be estimated more easily.
NoteMust Know
See how the results are nearly identical below. Then consider different design points.
When \(n\) is small, \(\hat{b}_{U}(x, h)\) is typically estimated for each unique observed value: \(x \in \{ x_{1},...x_{n} \}\). For large datasets, you can select a subset or evenly spaced values of \(x\) for which to make predictions.
If \(\hat{X}_{i}\) represents time, then the local constant regressions is also called a moving average. We can create a weighted moving average by using a different distribution. I.e. using the Normal dnorm instead of the Uniform dunif.
The basic idea also generalizes other kernels. As such, a kernel regression using uniform weights is often called a “naive kernel regression”. Typically, kernel regressions use kernels that weight nearby observations more heavily.
Bias
The regressogram has a bias at the edges of the bins, which is addressed by the local constant regression. Still, the local constant regression has a bias near the edges of the dataset. (The kernel divides by \(2h\) assuming there is data on both sides of the design point). This can be addressed by renormalizing weights.
TipTest Yourself
Here is an example going into the details of the weights.
A less simple case is a local linear regression which conducts a linear regression for each data point using a subsample of data around it. Consider a point \(x\) and model \(\hat{Y}_{i} = b_{0}(x,h) + b_{1}(x) \hat{X}_{i} + e_{i}\) for data near \(x\). The weighted OLS estimator with kernel weights is \[\begin{eqnarray}
& & \min_{b_{0}(x, h),b_{1}(x, h)}~ \sum_{i}^{n}\left[\hat{Y}_{i}- b_{0}(x, h) - b_{1}(x,h) \hat{X}_{i} \right]^2 k_{U}\left( \hat{X}_{i}, x, h\right)
\end{eqnarray}\] Deriving the optimal values \(\hat{b}_{0}(x, h)\) and \(\hat{b}_{1}(x,h)\) for \(k_{U}\) is left as a homework exercise.
Code
# ``Naive" Smootherpred_fun <-function(x0, h){# Assign equal weight to observations within h distance to x0# 0 weight for all other observations ki <-abs(dat[, 'school']-x0)/h ki <-dunif(ki)/(2*h) ## Could change, e.g. dnorm(ki)/h wi <- ki/sum(ki, na.rm=T) # always sum to 1 (for edge-effects)# run regression with weighted data llls_i <-lm(wage ~ school, data=dat, weights=wi) yhat_i <-predict(llls_i, newdata=data.frame(school=x0))}X0 <-seq(4, 16, by=1) # Design points# Fine Binspred_lo1 <-rep(NA, length(X0))for(i inseq_along(X0)){ pred_lo1[i] <-pred_fun(x0=X0[i], h=2)}# Course Binspred_lo2 <-rep(NA, length(X0))for(i inseq_along(X0)){ pred_lo2[i] <-pred_fun(x0=X0[i], h=6)}# Plotplot(wage ~ school, pch=16, data=Wages1, col=grey(0, .1),main=NA, ylab='Wage', xlab='School')cols <-c(rgb(.8, 0, 0, .5), rgb(0, 0, .8, .5))lines(X0, pred_lo1, col=cols[1], lwd=1, type='o')lines(X0, pred_lo2, col=cols[2], lwd=1, type='o')legend('topleft', title='Locally Linear',legend=c('h=2', 'h=6'),lty=1, col=cols, cex=.8)
NoteMust Know
Compare local linear and local constant regressions using https://shinyserv.es/shiny/kreg/, with degree \(0\) and \(1\). Also try different kernels and different datasets.
TipTest Yourself
Examine the local relationship between ‘Murder’ and ‘Urbanization’ in the USArrests dataset using LLLS.
Explain the difference between a regressogram and a local linear regression. In what sense does increasing the number of bins in a regressogram create a bias-variance tradeoff?
Using the Wages1 dataset from the Ecdat package, fit a regressogram with 4 bins and a piecewise regression with 4 bins for the relationship between wage and school. Compute the predicted value at school = 10 for each model.
Using the USArrests dataset, write R code to estimate a local linear regression of Murder on UrbanPop using uniform kernel weights with bandwidth \(h = 10\). Compute predictions at design points X0 <- seq(40, 80, by = 10) and plot them over the scatterplot.