A sample is a subset of the population. A simple random sample is a sample where each possible sample of size \(n\) has the same probability of being selected.
Note
Code
# Simple random sample (no duplicates, equal probability)x <-c(1,2,3,4) # populationsample(x, 2, replace=F) #sample## [1] 3 2
Tip
Factorials are used for counting permutations: different ways of rearranging \(n\) distinct objects into a sequence. The factorial is denoted as \(n!=1\times2\times3 ... (n-2)\times(n-1)\times n\).
E.g., how many ways are there to order the numbers \(\{1, 2, 3\}\)?
The binomial coefficient counts the subsets of \(k\) elements from a set with \(n\) elements, and is mathematically defined as \(\tbinom{n}{k}=\frac{n!}{n!(n-k)!}\).
For example, how many subsets with \(k=2\) are there for the set \(\{1,2,3,4\}\)?
Code
#Ways to draw k=2 from a set with n=4#{1,2} {1,3} {1,4} # {2,3} {2,4} # {3,4} choose(4,2)## [1] 6
How many possible samples of two are there from a population with this data: \({7,10,122,55}\)?
Often, we think of the population as being infinitely large. This is an approximation that makes mathematical and computational work much simpler.
Code
#All possible samples of two from an enormous bag of numbers {1,2,3,4}#{1,1} {1,2} {1,3}, {3,4}#{2,2} {2,3} {2,4}#{3,3} {3,4}#{4,4}# Simple random sample (duplicates, equal probability)sample(x, 2, replace=T)## [1] 2 4
Intuition for infinite populations: imagine drawing names from a giant urn. If the urn has only \(10\) names, then removing one name slightly changes the composition of the urn, and the probabilities shift for the next name you draw. Now imagine the urn has \(100\) billion names, so that removing one makes no noticeable difference. We can pretend the composition never changes: each draw is essentially identical and independent (iid). We can actually guarantee the names are iid by putting any names drawn back into the urn (sampling with replacement).
In R, you can sample from a continuous population using these using rXXX functions, where XXX is the the population type (unif, exp, norm)
The sampling distribution of a statistic shows us how much a statistic varies from sample to sample.
For example, the sampling distribution of the mean shows how the sample mean varies from sample to sample to sample. The sampling distribution of mean can also be referred to as the probability distribution of the sample mean.
Note
Given ages for population of \(4\) students, compute the sampling distribution for the mean with samples of \(n=2\).
Now compute the sampling distribution for the median with samples of \(n=3\).
Note
To consider an infinite population, expand the loop below to consider 3000 samples and then make a histogram.
Code
# Three Sample Example from infinite populationx1 <-runif(100)m1 <-mean(x1)x2 <-runif(100)m2 <-mean(x2)x3 <-runif(100)m3 <-mean(x3)sample_means <-c(m1, m2, m3)sample_means## [1] 0.4999462 0.5447282 0.5097730# An Equivalent Approach: fill vector in a loopsample_means <-vector(length=3)for(i inseq(sample_means)){ x <-runif(100) m <-mean(x) sample_means[i] <- m}sample_means## [1] 0.4941434 0.5083138 0.5060991
# Three Sample Example w/ Visualpar(mfrow=c(1,3))for(b in1:3){ x <-runif(100) m <-mean(x)hist(x,breaks=seq(0,1,by=.1), #for comparabilityfreq=F, main=NA, border=NA)abline(v=m, col=2, lwd=2)title(paste0('mean= ', round(m,2)), font.main=1)}
Examine the sampling distribution of the mean
Code
# Many sample examplesample_means <-vector(length=500)for(i inseq_along(sample_means)){ x <-runif(1000) m <-mean(x) sample_means[i] <- m}hist(sample_means, breaks=seq(0.45,0.55,by=.001),border=NA, freq=F,col=2, font.main=1, xlab=expression(hat(M)),main='Sampling Distribution of the mean')
In this figure, you see two the most profound results known in statistics
Law of Large Numbers: the sample mean is centered around the true mean, and more tightly centered with more data
Central Limit Theorem: the sampling distribution of the mean is approximately Normal.
Law of Large Numbers.
There are different variants of the Law of Large Numbers (LLN), but they all say some version of “the sample mean is centered around the true mean, and more tightly centered with more data”.
Note
Notice where the sampling distribution is centered
par(mfrow=c(1,3))for(n inc(5,50,500)){ sample_means_n <-vector(length=299)for(i inseq_along(sample_means_n)){ x <-runif(n) m <-mean(x) sample_means_n[i] <- m }hist(sample_means_n, breaks=seq(0,1,by=.01),border=NA, freq=F,col=2, font.main=1, xlab=expression(hat(M)),main=paste0('n=',n) )}
Tip
Plot the variability of the sample mean as a function of sample size
Code
n_seq <-seq(1, 40)sd_seq <-vector(length=length(n_seq))for(n inseq_along(sd_seq)){ sample_means_n <-vector(length=499)for(i inseq_along(sample_means_n)){ x <-runif(n) m <-mean(x) sample_means_n[i] <- m } sd_seq[n] <-sd(sample_means_n)}plot(n_seq, sd_seq, pch=16, col=grey(0,0.5),xlab='n', ylab='sd of sample means', main='')
Central Limit Theorem.
There are different variants of the central limit theorem (CLT), but they all say some version of “the sampling distribution of the mean is approximately normal”. For example, the sampling distribution of the mean, shown above, is approximately normal.
Note
Code
hist(sample_means,breaks=seq(0.45,0.55,by=.001),border=NA, freq=F,col=2, font.main=1,xlab=expression(hat(M)),main='Sampling Distribution of the mean')## Approximately normal?mu <-mean(sample_means)mu_sd <-sd(sample_means)x <-seq(0.1, 0.9, by=0.001)fx <-dnorm(x, mu, mu_sd)lines(x, fx, col='red')
Tip
For an example with another statistic, let’s the sampling distribution of the standard deviation.
Code
# CLT example of the "sd" statisticsample_sds <-vector(length=1000)for(i inseq_along(sample_sds)){ x <-runif(100) # same distribution s <-sd(x) # different statistic sample_sds[i] <- s}hist(sample_sds,breaks=seq(0.2,0.4,by=.01),border=NA, freq=F,col=4, font.main=1,xlab=expression(hat(S)),main='Sampling Distribution of the sd')## Approximately normal?mu <-mean(sample_sds)mu_sd <-sd(sample_sds)x <-seq(0.1, 0.9, by=0.001)fx <-dnorm(x, mu, mu_sd)lines(x, fx, col='blue')
Code
# try another random variable, such as rexp(100) instead of runif(100)
It is beyond this class to prove this result mathematically, but you should know that not all sampling distributions are standard normal. The CLT approximation is better for “large \(n\)” datasets with “well behaved” variances. The CLT also does not apply to “extreme” statistics.
Note
For example of “extreme” statistics, examine the sampling distribution of min and max statistics.
Code
# Create 300 samples, each with 1000 random uniform variablesx_samples <-matrix(nrow=300, ncol=1000)for(i inseq(1,nrow(x_samples))){ x_samples[i,] <-runif(1000)}# Each row is a new samplelength(x_samples[1,])## [1] 1000# Compute min and max for each samplex_mins <-apply(x_samples, 1, quantile, probs=0)x_maxs <-apply(x_samples, 1, quantile, probs=1)# Plot the sampling distributions of min, median, and max# Median looks normal. Maximum and Minumum do not!par(mfrow=c(1,2))hist(x_mins, breaks=100, main='Min', font.main=1,xlab=expression(hat(Q)[0]), border=NA, freq=F)hist(x_maxs, breaks=100, main='Max', font.main=1,xlab=expression(hat(Q)[1]), border=NA, freq=F)title('Sampling Distributions', outer=T, line=-1, adj=0)
5 Explore another function, such as
my_function <- function(x){ diff(range(x)) }
Tip
Here is an example where variance is not “well behaved” .
Code
sample_means <-vector(length=999)for(i inseq_along(sample_means)){ x <-rcauchy(1000,0,10) m <-mean(x) sample_means[i] <- m} )hist(sample_means, breaks=100,main='',border=NA, freq=F) # Tails look too "fat"
The Fundamental Theorem of Statistics.
The Law of Large Numbers generalizes to many other statistics, like median or sd.1
Note
Here is a sampling distribution for a quantile, for three different sample sizes
Code
par(mfrow=c(1,3))for(n inc(5,50,500)){ sample_quants_n <-vector(length=299)for(i inseq_along(sample_quants_n)){ x <-runif(n) m <-quantile(x, probs=0.75) #upper quartile sample_quants_n[i] <- m }hist(sample_quants_n, breaks=seq(0,1,by=.01),border=NA, freq=F,col=2, font.main=1, xlab="Sample Quantile",main=paste0('n=',n) )}
Here is a sampling distribution for a proportion, for three different sample sizes
In fact, the Glivenko-Cantelli Theorem (GCT) shows entire empirical distribution converges: the ECDF gets increasingly close to the CDF as the sample sizes grow. This result is often termed the The Fundamental Theorem of Statistics.
Note
Code
par(mfrow =c(1, 3))for (n inc(50, 500, 5000)) { x <-runif(n) Fx <-ecdf(x)plot(Fx)}
5.1 Resampling
Often, we only have one sample. How then can we estimate the sampling distribution of a statistic?
We can “resample” our data. Hesterberg (2015) provides a nice illustration of the idea. The two most basic versions are the jackknife and the bootstrap, which are discussed below.
Note that we do not use the mean of the resampled statistics as a replacement for the original estimate. This is because the resampled distributions are centered at the observed statistic, not the population parameter. (The bootstrapped mean is centered at the sample mean, for example, not the population mean.) This means that we do not use resampling to improve on \(\hat{M}\). We use resampling to estimate sampling variability.
Jackknife Distribution.
Here, we compute all “leave-one-out” estimates. Specifically, for a dataset with \(n\) observations, the jackknife uses \(n-1\) observations other than \(i\) for each unique subsample.
mean of the jackknife: \(\bar{\hat{M}}^{\text{jack}}=\frac{1}{n} \sum_{i}^{n} \hat{M}_{i}^{\text{jack}}\).
standard deviation of the jackknife: \(\hat{SE}^{\text{jack}}= \sqrt{ \frac{1}{n} \sum_{i}^{n} \left[\hat{M}_{i}^{\text{jack}} - \bar{\hat{M}}^{\text{jack}} \right]^2 }\).
Given the sample \(\{1,6,7,22\}\), compute the jackknife estimate of the median. Show the result mathematically by hand and also with the computer.
Here, we draw \(n\) observations with replacement from the original data to create a bootstrap sample and calculate a statistic. Each bootstrap sample \(b=1...B\) uses a random subset of observations to compute a statistic. We repeat that many times, say \(B=9999\), to estimate the sampling distribution.
Note
Taking the mean, for example, we have
bootstrapped estimate: \(\hat{M}_{b}^{\text{boot}}= \frac{1}{n} \sum_{i=1}^{n} \hat{X}_{i}^{(b)}\), for resampled data \(\hat{X}_{i}^{(b)}\)
mean of the bootstrap: \(\bar{\hat{M}}^{\text{boot}}= \frac{1}{B} \sum_{b} \hat{M}_{b}^{\text{boot}}\).
standard deviation of the bootstrap: \(\hat{SE}^{\text{boot}}= \sqrt{ \frac{1}{B} \sum_{b=1}^{B} \left[\hat{M}_{b}^{\text{boot}} - \bar{\hat{M}}^{\text{boot}} \right]^2}\).
Given the sample \(\{1,6,7,22\}\), compute the bootstrap estimate of the median, with \(B=8\). Show the result with the computer and then show what is happening in each sample by hand.
Why does this work? The sample: \(\{\hat{X}_{1}, \hat{X}_{2}, ... \hat{X}_{n}\}\) is drawn from a CDF \(F\). Each bootstrap sample: \(\{\hat{X}_{1}^{(b)}, \hat{X}_{2}^{(b)}, ... \hat{X}_{n}^{(b)}\}\) is drawn from the ECDF \(\hat{F}\). With \(\hat{F} \approx F\), each bootstrap sample is approximately a random sample. So when we compute a statistic on each bootstrap sample, we approximate the sampling distribution of the statistic.
Tip
Here is an intuitive example with Bernoulli random variables (unfair coin flips)
Code
# theoretical probabilitiesx <-c(0,1)x_probs <-c(1/4, 3/4)# sample drawscoin_sample <-sample(x, prob=x_probs, 1000, replace=T)Fhat <-ecdf(coin_sample)x_probs_boot <-c(Fhat(0), 1-Fhat(0))x_probs_boot # approximately the theoretical value## [1] 0.239 0.761coin_resample <-sample(x, prob=x_probs_boot, 999, replace=T)# any draw from here is almost the same as the original process
Standard Errors.
Using either the bootstrap or jackknife distribution, we can estimate variability of the sampling distribution of the statistic via the Standard Error. In either case, this differs from the standard deviation of the data within your sample.
sample standard deviation: variability of individual observations within a single sample.
sample standard error: variability of a statistic across repeated samples.
Code
sd(sample_dat) # standard deviation## [1] 4.35551sd(bootstrap_means) # standard error## [1] 0.6040716
Note that both Jackknife and Bootstrap resampling methods provide imperfect estimates, and can give different numbers. Jackknife resamples are often less variable than they should be and sample \(n-1\) instead of \(n\). Bootstrap resamples have the right \(n\) but often have duplicated data. Until you know more, a conservative rule-of-thumb is to take the larger estimate (often the bootstrap). That is also a good rule-of-thumb when considering theoretically derived confidence intervals too.
Also note that each additional data point you have provides more information, which ultimately decreases the standard error of your estimates. This is why statisticians will often recommend that you to get more data. However, the improvement in the standard error increases at a diminishing rate. In economics, this is known as diminishing returns and why economists may recommend you do not get more data.
Code
B <-1000# number of bootstrap samplesNseq <-seq(1,100, by=1) # different sample sizesSE <-vector(length=length(Nseq))for(n in Nseq){ sample_statistics_n <-vector(length=B)for(b inseq(1,B)){ x_b <-rnorm(n) # Sample of size n x_stat_b <-quantile(x_b, probs=.4) # Stat of interest sample_statistics_n[b] <- x_stat_b } se_n <-sd(sample_statistics_n) # How much the stat varies across samples SE[n] <- se_n}plot(Nseq, SE, pch=16, col=grey(0,.5),ylab='standard error', xlab='sample size')
When a statistic converges in probability to the quantity they are meant to estimate, they are called consistent.↩︎