6 (Re)Sampling
6.1 Sample Distributions
The sampling distribution of a statistic shows us how much a statistic varies from sample to sample.
For example, see how the mean varies from sample to sample to sample.
# Three Sample Example
par(mfrow=c(1,3))
sapply(1:3, function(i){
x <- runif(100)
m <- mean(x)
hist(x,
breaks=seq(0,1,by=.1), #for comparability
main=NA, border=NA)
abline(v=m, col=2, lwd=2)
title(paste0('mean= ', round(m,2)), font.main=1)
return(m)
})
## [1] 0.5425521 0.4771145 0.5038657
Examine the sampling distribution of the mean
sample_means <- sapply(1:1000, function(i){
m <- mean(runif(100))
return(m)
})
hist(sample_means, breaks=50, border=NA,
col=2, font.main=1,
main='Sampling Distribution of the mean')
This is one of the most profound results known in statistics, known as the central limit theorem: the sampling distribution of the mean is approximately standard normal.
central limit theorem. There are actually many different variants of the central limit theorem, as it applies more generally: the sampling distribution of many statistics are standard normal. For example, examine the sampling distribution of the standard deviation.
## [1] 0.2866693 0.2907758 0.2814790
sample_sds <- sapply(1:1000, function(i){
s <- sd(runif(100))
return(s)
})
hist(sample_sds, breaks=50, border=NA,
col=4, font.main=1,
main='Sampling Distribution of the sd')
It is beyond this class to prove this result, but you should know that not all sampling distributions are standard normal. For example, examine the sampling distribution of the three main “order statistics”
# Create 300 samples, each with 1000 random uniform variables
x <- sapply(1:300, function(i) runif(1000) )
# Each row is a new sample
length(x[1,])
## [1] 300
# Median looks normal, Maximum and Minumum do not!
xmin <- apply(x,1,quantile, probs=0)
xmed <- apply(x,1,quantile, probs=.5)
xmax <- apply(x,1,quantile, probs=1)
par(mfrow=c(1,3))
hist(xmin, breaks=100, border=NA, main='Min', font.main=1)
hist(xmed, breaks=100, border=NA, main='Med', font.main=1)
hist(xmax, breaks=100, border=NA, main='Max', font.main=1)
title('Sampling Distributions', outer=T, line=-1)
# To explore, try any function!
fun_of_rv <- function(f, n=100){
x <- runif(n)
y <- f(x)
return(y)
}
fun_of_rv( f=mean )
## [1] 0.4634983
## [1] 1.602588
6.2 Intervals
Using either the bootstrap or jackknife distribution, we can calculate
- confidence interval: range your statistic varies across different samples.
- standard error: variance of your statistic across different samples.
## [1] 0.01675691
Note that in some cases (not discussed here), you can estimate the standard error to get a confidence interval.
Confidence Interval. Compute the upper and lower quantiles of the sampling distribution.
bks <- seq(.4,.6,by=.001)
hist(sample_means, breaks=bks, border=NA,
col=rgb(0,0,0,.25), font.main=1,
main='Confidence Interval for the mean')
# Middle 90%
mq <- quantile(sample_means, probs=c(.05,.95))
abline(v=mq)
## [1] "we are 90% confident that the mean is between 0.47 and 0.53"
sample_quants <- apply(x,1,quantile, probs=.99)
bks <- seq(.92,1,by=.001)
hist(sample_quants, breaks=bks, border=NA,
col=rgb(0,0,0,.25), font.main=1,
main='Confidence Interval for the 99% percentile')
# Middle 95%
mq <- quantile(sample_quants, probs=c(.025,.975))
abline(v=mq)
paste0('we are 95% confident that the upper percentile is between ',
round(mq[1],2), ' and ', round(mq[2],2) )
## [1] "we are 95% confident that the upper percentile is between 0.97 and 1"
(See also https://online.stat.psu.edu/stat200/lesson/4/4.4/4.4.2)
Prediction Interval. Compute the frequency each value was covered.
# Middle 90% of values
xq0 <- quantile(x, probs=c(.05,.95))
bks <- seq(0,1,by=.01)
hist(x, breaks=bks, border=NA,
main='Prediction Interval', font.main=1)
abline(v=xq0)
paste0('we are 90% confident that the a future data point will be between ',
round(xq0[1],2), ' and ', round(xq0[2],2) )
## [1] "we are 90% confident that the a future data point will be between 0.05 and 0.95"
Advanced Intervals. In many cases, we want a X% interval to mean that X% of the intervals we generate will contain the mean (confidence interval) or new observations (prediction interval). E.g., a 50% CI means that half of intervals we create contain the true mean.
# Confidence Interval for each sample
xq <- apply(x,1, function(r){ mean(r) + c(-1,1)*sd(r) })
# First 4 interval estimates
xq[,1:4]
## [,1] [,2] [,3] [,4]
## [1,] 0.1935914 0.2085894 0.2303030 0.1979923
## [2,] 0.8023608 0.7771637 0.7939178 0.7716116
# Frequency each point was in an interval
bks <- seq(0,1,by=.01)
xcov <- sapply(bks, function(b){
bl <- b >= xq[1,]
bu <- b <= xq[2,]
mean( bl & bu )
})
plot.new()
plot.window(xlim=c(0,1), ylim=c(0,1))
polygon( c(bks, rev(bks)), c(xcov, xcov*0), col=grey(.5,.5), border=NA)
mtext('Frequency each value was in an interval',2, line=2.5)
axis(1)
axis(2)
# 50\% Coverage
c_ul <- range(bks[xcov>=.5])
abline(h=.5, lwd=2)
segments(c_ul,0,c_ul,.5, lty=2)
c_ul # 50% confidence interval
## [1] 0.22 0.78
6.3 Resampling
Often, we only have one sample.
## [1] 7.788
How then can we estimate the sampling distribution of a statistic? We can “resample” our data. Hesterberg (2015) provides a nice illustration of the idea. The two most basic versions are the jackknife and the bootstrap, which are discussed below.
sample_dat <- USArrests$Murder
sample_mean <- mean(sample_dat)
# Jackknife Estimates
n <- length(sample_dat)
Jmeans <- sapply(1:n, function(i){
dati <- sample_dat[-i]
mean(dati)
})
hist(Jmeans, breaks=25, border=NA,
main='', xlab=expression(bar(X)[-i]))
abline(v=sample_mean, col='red', lty=2)
# Bootstrap estimates
set.seed(2)
Bmeans <- sapply(1:10^4, function(i) {
dat_b <- sample(sample_dat, replace=T) # c.f. jackknife
mean(dat_b)
})
hist(Bmeans, breaks=25, border=NA,
main='', xlab=expression(bar(X)[b]))
abline(v=sample_mean, col='red', lty=2)
Caveat. Note that we do not use the mean of the bootstrap or jackknife statistics as a replacement for the original estimate. This is because the bootstrap and jackknife distributions are centered at the observed statistic, not the population parameter. (The bootstrapped mean is centered at the sample mean, not the population mean.) This means that we cannot use the bootstrap to improve on \(\overline{x}\); no matter how many bootstrap samples we take. We can, however, use the jackknife and bootstrap to estimate sampling variability.
Intervals. Note that both methods provide imperfect estimates, and can give different numbers. Until you know more, a conservative approach is to take the larger estimate.
## 2.5% 97.5%
## 6.58200 8.97005
## 2.5% 97.5%
## 7.621582 7.904082
Also note that the standard deviation refers to variance within a single sample, and is hence different from the standard error. Nonetheless, they can both be used to estimate the variability of a statistic.
## [1] 0.6056902 0.6159621
6.4 Value of More Data
Each additional data point you have provides more information, which ultimately decreases the standard error of your estimates. However, it does so at a decreasing rate (known in economics as diminishing returns).
Nseq <- seq(1,100, by=1) # Sample sizes
B <- 1000 # Number of draws per sample
SE <- sapply(Nseq, function(n){
sample_statistics <- sapply(1:B, function(b){
x <- rnorm(n) # Sample of size N
quantile(x,probs=.4) # Statistic
})
sd(sample_statistics)
})
par(mfrow=c(1,2))
plot(Nseq, SE, pch=16, col=grey(0,.5),
main='Absolute Gain', font.main=1,
ylab='standard error', xlab='sample size')
plot(Nseq[-1], abs(diff(SE)), pch=16, col=grey(0,.5),
main='Marginal Gain', font.main=1,
ylab='decrease in standard error', xlab='sample size')