17  Statistical Theory


17.1 Theoretical Statistics

Conditional Expectation.

The conditional expectation function \[\begin{eqnarray} m(x) = \mathbb{E}[Y_i|X_i=x] \end{eqnarray}\] is the average value of \(Y_i\) among observations with \(X_i=x\). In empirical work, this is the population object that regressions try to approximate.

For discrete \(X_i\), it is a weighted average over conditional probabilities: \[\begin{eqnarray} \mathbb{E}[Y_i|X_i=x] = \sum_y y \cdot Prob(Y_i=y|X_i=x). \end{eqnarray}\] For continuous \(X_i\), we interpret \(m(x)\) as a smooth curve indexed by \(x\).

The key property is that conditional expectation is the best mean-squared predictor: \[\begin{eqnarray} m(x) = \arg\min_{a(x)} \mathbb{E}\left[(Y_i-a(X_i))^2\right]. \end{eqnarray}\] This is why local and global least-squares methods are central: both are trying to estimate \(m(x)\) under different shape restrictions.

Example (discrete):

\(x=0\) \(x=1\)
\(y=0\) 0.30 0.10
\(y=1\) 0.10 0.20
\(y=2\) 0.10 0.20

From the table, \(Prob(X_i=0)=0.5\) and \(Prob(X_i=1)=0.5\). Then \[\begin{eqnarray} \mathbb{E}[Y_i|X_i=0] &=& 0\cdot 0.6 + 1\cdot 0.2 + 2\cdot 0.2 = 0.6,\\ \mathbb{E}[Y_i|X_i=1] &=& 0\cdot 0.2 + 1\cdot 0.4 + 2\cdot 0.4 = 1.2. \end{eqnarray}\] So moving from \(x=0\) to \(x=1\) increases the conditional mean by \(0.6\).

Code
# Joint probabilities: rows are y, columns are x
y_vals <- c(0,1,2)
x_vals <- c(0,1)
P_yx <- matrix(c(
  0.30, 0.10,
  0.10, 0.20,
  0.10, 0.20
), nrow=3, byrow=TRUE)

# Marginal Prob(X_i=x)
P_x <- colSums(P_yx)

# Conditional probabilities Prob(Y_i=y | X_i=x)
P_y_given_x <- sweep(P_yx, 2, P_x, "/")

# Conditional expectation m(x)=E[Y_i|X_i=x]
m_x <- colSums(P_y_given_x * y_vals)
m_x
## [1] 0.6 1.2

Consistency of Local Regression (LLLS/LOESS).

Local least squares methods (LLLS, LOESS) estimate \(m(x)=\mathbb{E}[Y_i|X_i=x]\) using nearby points. Their consistency comes from two conditions as sample size grows:

  • neighborhoods shrink, so local bias decreases
  • local sample size still grows, so local variance decreases

For kernel/local methods this is often written as bandwidth conditions: \[\begin{eqnarray} h_n \to 0 \quad\text{and}\quad n h_n \to \infty. \end{eqnarray}\] In LOESS language, this corresponds to the span shrinking with \(n\), but not too fast.

The simulation below illustrates this at one target point \(x_0\): absolute error in estimating \(m(x_0)\) tends to decrease with larger \(n\).

Code
set.seed(42)

true_m <- function(x) sin(2*x) + 0.5*x
x0 <- 0.25
n_grid <- c(60, 120, 240, 480)
R <- 120

avg_abs_err <- sapply(n_grid, function(n){
  span_n <- min(0.9, 1.8*n^(-1/4)) # shrinks with n
  errs <- replicate(R, {
    x <- runif(n, -1.5, 1.5)
    y <- true_m(x) + rnorm(n, sd=0.35)
    fit <- loess(y~x, span=span_n, degree=1)
    mhat <- predict(fit, newdata=data.frame(x=x0))
    abs(mhat - true_m(x0))
  })
  mean(errs, na.rm=TRUE)
})

plot(n_grid, avg_abs_err, type='b', pch=16, col=2,
     xlab='Sample size (n)',
     ylab='Average |mhat(x0)-m(x0)|',
     main='LLLS/LOESS consistency illustration',
     font.main=1)

Bayes’ Theorem.

Bayes’ theorem maps predictive statements into inferential statements. In bivariate form, \[\begin{eqnarray} Prob(X_i=x \mid Y_i=y) &=& \frac{Prob(Y_i=y \mid X_i=x)Prob(X_i=x)}{Prob(Y_i=y)}. \end{eqnarray}\]

Interpretation:

  • \(Prob(X_i=x)\) is the prior probability for \(X_i=x\).
  • \(Prob(Y_i=y|X_i=x)\) is the likelihood of seeing \(Y_i=y\) if \(X_i=x\) is true.
  • \(Prob(X_i=x|Y_i=y)\) is the posterior, your updated probability after seeing \(Y_i=y\).

A useful way to remember this is \[\begin{eqnarray} \text{Posterior} \propto \text{Likelihood} \times \text{Prior}. \end{eqnarray}\] For two states, posterior odds are prior odds times a likelihood ratio.

For a concrete example, suppose a screening test has sensitivity 0.90 and false positive rate 0.08, while prevalence is 0.12: \[\begin{eqnarray} Prob(Y_i=1|X_i=1)=0.90,\quad Prob(Y_i=1|X_i=0)=0.08,\quad Prob(X_i=1)=0.12, \end{eqnarray}\] where \(X_i=1\) means “condition present” and \(Y_i=1\) means “test positive”.

Then \[\begin{eqnarray} Prob(X_i=1|Y_i=1) &=& \frac{0.90\times0.12}{0.90\times0.12 + 0.08\times0.88} \approx 0.605. \end{eqnarray}\] Even with a good test, posterior probability depends strongly on prevalence.

Code
# States: X in {0,1}, signal Y in {0,1}
# Prior Prob(X_i=1)
p_x1 <- 0.12

# Test characteristics
p_y1_x1 <- 0.90
p_y1_x0 <- 0.08

# Law of total probability for Prob(Y_i=1)
p_y1 <- p_y1_x1 * p_x1 + p_y1_x0 * (1 - p_x1)

# Bayes posterior Prob(X_i=1 | Y_i=1)
p_x1_y1 <- (p_y1_x1 * p_x1) / p_y1
p_x1_y1
## [1] 0.6053812

# Also compute Prob(X_i=1 | Y_i=0)
p_y0_x1 <- 1 - p_y1_x1
p_y0_x0 <- 1 - p_y1_x0
p_y0 <- p_y0_x1 * p_x1 + p_y0_x0 * (1 - p_x1)
p_x1_y0 <- (p_y0_x1 * p_x1) / p_y0
p_x1_y0
## [1] 0.01460565

17.2 Testing Theory

Theoretical Distributions.

Just as with one sample tests, we can compute a standardized differences, where is converted into a statistic. Note, however, that we have to compute the standard error for the difference statistic, which is a bit more complicated. Under the assumption that both populations are independent distributed, we can analytically derive the sampling distribution for the differences between two groups.

In particular, the \(t\)-statistic is used to compare two groups. \[\begin{eqnarray} \hat{t} = \frac{ \hat{M}_{Y1} - \hat{M}_{Y2} }{ \sqrt{\hat{S}_{Y1}+\hat{S}_{Y2}}/\sqrt{n} }, \end{eqnarray}\] With normally distributed means, this statistic follows Student’s t-distribution. Welch’s \(t\)-statistic is an adjustment for two normally distributed populations with potentially unequal variances or sample sizes. With the above assumptions, one can conduct hypothesis tests entirely using math.

Code
# Sample 1 (e.g., males)
n1 <- 100
Y1 <- rnorm(n1, 0, 2)
#hist(Y1, freq=F, main='Sample 1')

# Sample 2 (e.g., females)
n2 <- 80
Y2 <- rnorm(n2, 1, 1)
#hist(Y2, freq=F, main='Sample 2')

t.test(Y1, Y2, var.equal=F)
## 
##  Welch Two Sample t-test
## 
## data:  Y1 and Y2
## t = -4.6719, df = 141.64, p-value = 6.869e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4444412 -0.5855021
## sample estimates:
##  mean of x  mean of y 
## -0.0303432  0.9846285

If we want to test for the differences in medians across groups with independent observations, we can also use notches in the boxplot. If the notches of two boxes do not overlap, then there is rough evidence that the difference in medians is statistically significant. The square root of the sample size is also shown as the bin width in each boxplot.1

Code
Y3 <- rnorm(n1, 3, 3)

boxplot(Y1, Y2, Y3,
    col=c(
        rgb(1,0,0,.5),
        rgb(0,1,0,.5),
        rgb(0,0,1,.5)),
    notch=T,
    varwidth=T)

There are also theoretical results for distributional comparisons

Code
library(Ecdat)
data(Caschool)
Caschool[,'stratio'] <- Caschool[,'enrltot']/Caschool[,'teachers']
kruskal.test(Caschool[,'stratio'], Caschool[,'county'])

# Multiple pairwise tests
# pairwise.wilcox.test(Caschool[,'stratio'], Caschool[,'county'])

17.3 Type II Errors

When we test a hypothesis, we start with a claim called the null hypothesis \(H_0\) and an alternative claim \(H_A\). Because we base conclusions on sample data, which has variability, mistakes are possible. There are two types of errors:

  • Type I Error: Rejecting a true null hypothesis. (False Positive).
  • Type II Error: Failing to reject a false null hypothesis (False Negative).
True Situation Decision: Fail to Reject \(H_0\) Decision: Reject \(H_0\)
\(H_0\) is True Correct (no detection) Type I Error (False Positive)
\(H_0\) is False Type II Error (False Negative; missed detection) Correct (effect detected)

Here is a Courtroom example: Someone suspected of committing a crime is at trial, and they are either guilty or not (a Bernoulli random variable). You hypothesize that the suspect is innocent, and a jury can either convict them (decide guilty) or free them (decide not-guilty). Recall that fail-to-reject a hypothesis does mean accepting it, so deciding not-guilty does not necessarily mean innocent.

True Situation Decision: Free Decision: Convict
Suspect Innocent Correctly Freed Falsely Convicted
Suspect Guilty Falsely Freed Correctly Convicted

Statistical Power.

The probability of Type I Error is called significance level and denoted by \(Prob(\text{Type I Error}) = \alpha\). The probability of correctly rejecting a false null is called power and denoted by \(\text{Power} = 1 - \beta = 1 - Prob(\text{Type II Error})\).

Significance is often chosen by statistical analysts to be \(\alpha=0.05\). Power is less often chosen, instead following from a decision about power.

The code below runs a small simulation using a shifted, nonparametric bootstrap. Two-sided test; studentized statistic, for \(H0: \mu = 0\)

Code
# Power for Two-sided test;
# nonparametric bootstrap, studentized statistic
n <- 25
mu <- 0
alpha <- 0.05
B <- 299

sim_reps <- 100

p_values <- vector(length=sim_reps)
for (i in seq(p_values)) {
    # Generate data
    X <- rnorm(n, mean=0.2, sd=1)
    # Observed statistic
    X_bar <- mean(X)
    T_obs <-  (X_bar - mu) / (sd(X)/ sqrt(n)) ##studentized
    # Bootstrap null distribution of the statistic
    T_boot <- vector(length=B)
    X_null <- X - X_bar + mu # Impose the null by recentering
    for (b in seq(T_boot)) {
      X_b <- sample(X_null, size = n, replace = TRUE)
      T_b <- (mean(X_b) - mu) / (sd(X_b)/sqrt(n))
      T_boot[b] <- T_b
    }
    # Two-sided bootstrap p-value
    pval <- mean(abs(T_boot) >= abs(T_obs))
    p_values[i] <- pval
    }
power <- mean(p_values < alpha)
power

There is an important Trade-off for fixed sample sizes: Increasing significance (fewer false positive) often lowers power (more false negatives). Generally, power depends on the effect size and sample size: bigger true effects and larger \(n\) make it easier to detect real differences (higher power, lower \(\beta\)).

17.4 Further Reading

Many introductory econometrics textbooks have a good appendix on probability and statistics. There are many useful statistical texts online too

See the Further reading about Probability Theory in the Statistics chapter.


  1. Let each group \(g\) have median \(\tilde{M}_{g}\), interquartile range \(\hat{IQR}_{g}\), observations \(n_{g}\). We can compute standard deviation of the median as \(\tilde{S}_{g}= \frac{1.25 \hat{IQR}_{g}}{1.35 \sqrt{n_{g}}}\). As a rough guess, the interval \(\tilde{M}_{g} \pm 1.7 \tilde{S}_{g}\) is the historical default and displayed as a notch in the boxplot. See also https://www.tandfonline.com/doi/abs/10.1080/00031305.1978.10479236.↩︎