Code
# Simple random sample (no duplicates, equal probability)
x <- c(1,2,3,4) # population
sample(x, 2, replace=F) #sample
## [1] 3 4A sample is a subset of the population. A simple random sample is a sample where each possible sample of size \(n\) has the same probability of being selected.
Often, we think of the population as being infinitely large. This is an approximation that makes mathematical and computational work much simpler.
#All possible samples of two from a bag of numbers {1,2,3,4} with replacement
#{1,1} {1,2} {1,3}, {3,4}
#{2,2} {2,3} {2,4}
#{3,3} {3,4}
#{4,4}
# Simple random sample (duplicates, equal probability)
sample(x, 2, replace=T)
## [1] 3 3Intuition for infinite populations: imagine drawing names from a giant urn. If the urn has only \(10\) names, then removing one name slightly changes the composition of the urn, and the probabilities shift for the next name you draw. Now imagine the urn has \(100\) billion names, so that removing one makes no noticeable difference. We can pretend the composition never changes: each draw is essentially identical and independent (iid). We can actually guarantee the names are iid by putting any names drawn back into the urn (sampling with replacement).
The sampling distribution of a statistic shows us how much a statistic varies from sample to sample.
For example, the sampling distribution of the mean shows how the sample mean varies from sample to sample to sample. The sampling distribution of mean can also be referred to as the probability distribution of the sample mean.
# Three Sample Example w/ Visual
par(mfrow=c(1,3))
for(b in 1:3){
x <- runif(100)
m <- mean(x)
hist(x,
breaks=seq(0,1,by=.1), #for comparability
freq=F, main=NA, border=NA)
abline(v=m, col=2, lwd=2)
title(paste0('mean= ', round(m,2)), font.main=1)
}
Examine the sampling distribution of the mean
# Many sample example
sample_means <- vector(length=500)
for(i in seq_along(sample_means)){
x <- runif(1000)
m <- mean(x)
sample_means[i] <- m
}
hist(sample_means,
breaks=seq(0.45,0.55,by=.001),
border=NA, freq=F,
col=2, font.main=1,
xlab=expression(hat(M)),
main='Sampling Distribution of the mean')
In this figure, you see two the most profound results known in statistics
There are different variants of the Law of Large Numbers (LLN), but they all say some version of “the sample mean becomes more tightly centered around the true mean as we get more data”.
There are different variants of the central limit theorem (CLT), but they all say some version of “the sampling distribution of the mean is approximately normal”. For example, the sampling distribution of the mean, shown above, is approximately normal.
Many statistics have an approximately Normal sampling distribution.
It is beyond this class to prove this result mathematically, but you should know that not all sampling distributions are standard normal. The CLT approximation is better for “large \(n\)” datasets with “well behaved” variances. The CLT also does not apply to “extreme” statistics.
The Law of Large Numbers generalizes to many other statistics, like median or sd.1
In fact, the Glivenko-Cantelli Theorem (GCT) shows entire empirical distribution converges: the ECDF gets increasingly close to the CDF as the sample sizes grow. This result is often termed the The Fundamental Theorem of Statistics.
Often, we only have one sample. How then can we estimate the sampling distribution of a statistic?
sample_dat <- USArrests[,'Murder']
sample_mean <- mean(sample_dat)
sample_mean
## [1] 7.788We can “resample” our data. Hesterberg (2015) provides a nice illustration of the idea. The two most basic versions are the jackknife and the bootstrap, which are discussed below.

Note that we do not use the mean of the resampled statistics as a replacement for the original estimate. This is because the resampled distributions are centered at the observed statistic, not the population parameter. (The bootstrapped mean is centered at the sample mean, for example, not the population mean.) This means that we do not use resampling to improve on \(\hat{M}\). We use resampling to estimate sampling variability.
Here, we compute all “leave-one-out” estimates. Specifically, for a dataset with \(n\) observations, the jackknife uses \(n-1\) observations other than \(i\) for each unique subsample.
sample_dat <- USArrests[,'Murder']
sample_mean <- mean(sample_dat)
# Jackknife Estimates
n <- length(sample_dat)
jackknife_means <- vector(length=n)
for(i in seq_along(jackknife_means)){
dat_noti <- sample_dat[-i]
mean_noti <- mean(dat_noti)
jackknife_means[i] <- mean_noti
}
hist(jackknife_means, breaks=25,
border=NA, freq=F,
main='', xlab=expression(hat(M)[-i]))
abline(v=sample_mean, col='red', lty=2)
Here, we draw \(n\) observations with replacement from the original data to create a bootstrap sample and calculate a statistic. Each bootstrap sample \(b=1...B\) uses a random resample of the observations to recompute a statistic. We repeat that many times, say \(B=9999\), to estimate the sampling distribution.
# Bootstrap estimates
bootstrap_means <- vector(length=9999)
for(b in seq_along(bootstrap_means)){
boot_id <- sample(n, replace=T)
dat_b <- sample_dat[boot_id] # c.f. jackknife
mean_b <- mean(dat_b)
bootstrap_means[b] <-mean_b
}
hist(bootstrap_means, breaks=25,
border=NA, freq=F,
main='', xlab=expression(hat(M)[b]))
abline(v=sample_mean, col='red', lty=2)
Why does this work? The sample: \(\{\hat{X}_{1}, \hat{X}_{2}, ... \hat{X}_{n}\}\) is drawn from a CDF \(F\). Each bootstrap sample: \(\{\hat{X}_{1}^{(b)}, \hat{X}_{2}^{(b)}, ... \hat{X}_{n}^{(b)}\}\) is drawn from the ECDF \(\hat{F}\). With \(\hat{F} \approx F\), each bootstrap sample is approximately a random sample. So when we compute a statistic on each bootstrap sample, we approximate the sampling distribution of the statistic.
Note that both Jackknife and Bootstrap resampling methods provide imperfect estimates, and can give different numbers.
The above procedure works for many different statistics
med <- quantile(sample_dat, prob=0.5)
# Bootstrap estimates
bootstrap_stat <- vector(length=9999)
for(b in seq_along(bootstrap_stat)){
boot_id <- sample(n, replace=T)
dat_b <- sample_dat[boot_id] # c.f. jackknife
stat_b <- quantile(dat_b, prob=0.5)
bootstrap_stat[b] <- stat_b
}
hist(bootstrap_stat, breaks=25,
border=NA, freq=F,
main='', xlab=expression(Med[b]))
abline(v=med, col='red', lty=2)
Using either the bootstrap or jackknife distribution, we typically communicate the variability of the sampling distribution of a statistic using the Standard Error. In either case, this differs from the standard deviation of the data within your sample.
There are “corrections” that can improve the basic jackknife SEs. Alternatively, we can compute bootstrap SEs.
Also note that each additional data point you have provides more information, which ultimately decreases the standard error of your estimates. This is why statisticians will often recommend that you to get more data. However, the improvement in the standard error increases at a diminishing rate. In economics, this is known as diminishing returns and why economists may recommend you do not get more data.
B <- 999 # number of bootstrap samples
Nseq <- seq(1, length(sample_dat), by=1) # different resample sizes
n_id <- seq_along(sample_dat) # who to potentially resample
## For each sample size, compute the bootstrap SE
SE <- vector(length=length(Nseq))
for(n in seq_along(Nseq)){
sample_stats_n <- vector(length=B)
for(b in seq(1,B)){
b_id <- sample(n_id, size=n, replace=T)
x_b <- sample_dat[b_id]
x_stat_b <- mean(x_b) # statistic of interest
sample_stats_n[b] <- x_stat_b
}
se_n <- sd(sample_stats_n) # How much the statistic varies across samples
SE[n] <- se_n
}
plot(Nseq, SE, pch=16, col=grey(0,.5),
ylab='standard error', xlab='sample size')
To generate a random variable from known distributions, you can use some type of physical machine. E.g., you can roll a fair die to generate Discrete Uniform data or you can roll weighted die to generate Categorical data.
There are also several ways to computationally generate random variables from a probability distribution. Perhaps the most common one is “inverse sampling”. To generate a random variable using inverse sampling, first sample \(p\) from a uniform distribution and then find the associated quantile quantile function \(\hat{F}^{-1}(p)\).2
You can generate a random variable from a known empirical distribution. Inverse sampling randomly selects observations from the dataset with equal probabilities. To implement this, we
If you know the distribution function that generates the data, then you can derive the quantile function and do inverse sampling. That is how computers generate random data from a distribution.
# 4 random data points from 3 different distributions
qunif(4)
## [1] NaN
qexp(4)
## [1] NaN
qnorm(4)
## [1] NaNHere is an in-depth example of the Dagum distribution. The distribution function is \(F(x)=(1+(x/b)^{-a})^{-c}\). For a given probability \(p\), we can then solve for the quantile as \(F^{-1}(p)=\frac{ b p^{\frac{1}{ac}} }{(1-p^{1/c})^{1/a}}\). Afterwhich, we sample \(p\) from a uniform distribution and then find the associated quantile.
# Theoretical Quantile Function (from VGAM::qdagum)
qdagum <- function(p, scale.b=1, shape1.a, shape2.c) {
# Quantile function (theoretically derived from the CDF)
ans <- scale.b * (expm1(-log(p) / shape2.c))^(-1 / shape1.a)
# Special known cases
ans[p == 0] <- 0
ans[p == 1] <- Inf
# Safety Checks
ans[p < 0] <- NaN
ans[p > 1] <- NaN
if(scale.b <= 0 | shape1.a <= 0 | shape2.c <= 0){ ans <- ans*NaN }
# Return
return(ans)
}
# Generate Random Variables (VGAM::rdagum)
rdagum <-function(n, scale.b=1, shape1.a, shape2.c){
p <- runif(n) # generate random probabilities
x <- qdagum(p, scale.b=scale.b, shape1.a=shape1.a, shape2.c=shape2.c) #find the inverses
return(x)
}
# Example
set.seed(123)
X <- rdagum(3000,1,3,1)
X[c(1,2,3)]
## [1] 0.7390476 1.5499868 0.8845006