20  Comparing Multiple Groups


With multiple groups, you will want to begin with a summary figure (such as a boxplot). We can also test the equality of all means or all distributions (whether at least one group is different).

A practical workflow for multiple groups has three steps:

  1. Visualize all groups together.
  2. Run one global test
  3. Run post-hoc pairwise tests with multiplicity correction.
Code
# USArrests grouped by US Census region
dat <- USArrests
dat$Region <- state.region

# Step 1: Visual summary
boxplot(Murder ~ Region, data = dat,
    col = hcl.colors(4, alpha = .45),
    ylab = "Murder arrests (per 100k)", xlab = "Region", main = "")
title("Murder rates by region", font.main = 1)

20.1 Equal Means (ANOVA)

In Comparing Groups, we tested whether two groups had the same mean. We now extend this to \(G\) groups. The Analysis of Variance (ANOVA) approach asks: are the observed differences in group means larger than we would expect from random variation alone?

Decomposition.

The key idea is to decompose total variation into between-group and within-group components. Label each observation \(\hat{Y}_{ig}\) for individual \(i\) in group \(g\). Each group \(g=1,\dots,G\) has \(n_g\) observations with group mean \(\hat{M}_{g}\), and the grand mean across all \(n=\sum_g n_g\) observations is \(\hat{M}_{Y}\). Then the total sum of squares decomposes as \[\begin{eqnarray} \underbrace{\sum_{g=1}^{G}\sum_{i=1}^{n_g}(\hat{Y}_{ig}-\hat{M}_{Y})^2}_{\hat{TSS}} &=& \underbrace{\sum_{g=1}^{G} n_g (\hat{M}_{g}-\hat{M}_{Y})^2}_{\hat{BSS}} \;+\; \underbrace{\sum_{g=1}^{G}\sum_{i=1}^{n_g}(\hat{Y}_{ig}-\hat{M}_{g})^2}_{\hat{WSS}}, \end{eqnarray}\] where \(\hat{BSS}\) is the between-group sum of squares (variation in group means around the grand mean) and \(\hat{WSS}\) is the within-group sum of squares (variation of observations around their own group mean).

If the group means are all equal, then \(\hat{BSS}\) should be small relative to \(\hat{WSS}\).

The F-statistic.

We test \(H_0: \mu_1=\mu_2=\dots=\mu_G\) versus \(H_A:\) at least one \(\mu_g\) differs. The ANOVA F-statistic is \[\begin{eqnarray} \hat{F} &=& \frac{\hat{BSS}/(G-1)}{\hat{WSS}/(n-G)}. \end{eqnarray}\] The numerator is the average between-group variation (per degree of freedom), and the denominator is the average within-group variation. A large \(\hat{F}\) means group means are more spread out than individual-level noise would predict.

Under additional parametric assumptions (independent observations, equal variances, normal errors), \(\hat{F}\) follows an \(F_{G-1,\;n-G}\) distribution. In practice, we can also use permutation or bootstrap methods to build a null distribution, just as in previous chapters.

Suppose we have \(G=3\) groups with the following data:

Group A Group B Group C
4 8 5
5 7 6
6 9 7

Group means: \(\hat{M}_A = 5\), \(\hat{M}_B = 8\), \(\hat{M}_C = 6\). Grand mean: \(\hat{M}_Y = (5+8+6)/3 = 19/3 \approx 6.33\). Each group has \(n_g = 3\).

Between-group sum of squares: \[\begin{eqnarray*} \hat{BSS} &=& 3(5-19/3)^2 + 3(8-19/3)^2 + 3(6-19/3)^2 \\ &=& 3(16/9) + 3(25/9) + 3(1/9) = 48/9 + 75/9 + 3/9 = 14 \end{eqnarray*}\]

Within-group sum of squares: \[\begin{eqnarray*} \hat{WSS} &=& [(4-5)^2+(5-5)^2+(6-5)^2] + [(8-8)^2+(7-8)^2+(9-8)^2] \\ && + [(5-6)^2+(6-6)^2+(7-6)^2] = 2+2+2 = 6 \end{eqnarray*}\]

Check: \(\hat{TSS} = 14+6 = 20\).

F-statistic: \(\hat{F} = \frac{14/2}{6/6} = 7\). With \(G-1=2\) and \(n-G=6\) degrees of freedom, we compare to an \(F_{2,6}\) distribution: \(p = 1 - F_{2,6}(7) \approx 0.027\).

Code
# Verify by hand
Y <- c(4,5,6, 8,7,9, 5,6,7)
g <- factor(c('A','A','A','B','B','B','C','C','C'))
n <- length(Y)
G <- nlevels(g)

M_Y <- mean(Y)
M_g <- tapply(Y, g, mean)

BSS <- sum(tapply(Y, g, length) * (M_g - M_Y)^2)
WSS <- sum(tapply(Y, g, function(y) sum((y - mean(y))^2)))

Fstat <- (BSS/(G-1)) / (WSS/(n-G))
pval  <- 1 - pf(Fstat, G-1, n-G)
cbind(BSS, WSS, Fstat, pval)
##      BSS WSS Fstat  pval
## [1,]  14   6     7 0.027

ANOVA in R.

In R, aov() fits the ANOVA model and summary() returns the standard ANOVA table. Equivalently, anova() applied to a regression with a single factor gives the same result.

Code
# USArrests grouped by US Census region
dat <- USArrests
dat$Region <- state.region

# Fit ANOVA
fit_aov <- aov(Murder ~ Region, data = dat)
summary(fit_aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Region       3  391.2   130.4   11.14 1.28e-05 ***
## Residuals   46  538.3    11.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table reports \(\hat{BSS}\) (“Sum Sq” for Region), \(\hat{WSS}\) (“Sum Sq” for Residuals), and the \(F\)-statistic with its \(p\)-value.

Code
# Equivalent via lm
fit_lm <- lm(Murder ~ Region, data = dat)
anova(fit_lm)
## Analysis of Variance Table
## 
## Response: Murder
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Region     3 391.24 130.412  11.144 1.282e-05 ***
## Residuals 46 538.32  11.703                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Connection to regression.

ANOVA is a special case of linear regression where the only explanatory variables are group indicators. The model \(\hat{Y}_{ig} = b_0 + \sum_{g=2}^{G} b_g \hat{D}_{ig} + e_{ig}\) produces the same F-statistic as above. Here \(b_0\) estimates the mean of the reference group and each \(b_g\) estimates the difference between group \(g\) and the reference.

Code
# Regression coefficients = group mean differences
coef(fit_lm)
##         (Intercept)         RegionSouth RegionNorth Central          RegionWest 
##            4.700000            7.006250            1.000000            2.330769

# Compare to group means
M_g <- tapply(dat$Murder, dat$Region, mean)
M_g
##     Northeast         South North Central          West 
##      4.700000     11.706250      5.700000      7.030769

# b0 = mean of reference group; b_g = mean(g) - mean(reference)
M_g - M_g[1]
##     Northeast         South North Central          West 
##      0.000000      7.006250      1.000000      2.330769

Assumptions and alternatives.

The parametric F-test assumes (i) independent observations, (ii) normal errors within each group, and (iii) equal variances across groups (homoscedasticity). When these assumptions are doubtful — especially equal variances — the nonparametric Kruskal-Wallis test (below) is a safer alternative. As a quick diagnostic, compare side-by-side boxplots: if the spread differs visibly across groups, be cautious with the parametric test.

20.2 Equal Distributions

The Kruskal-Wallis test examines \(H_0:\; F_1 = F_2 = \dots = F_G\) versus \(H_A:\) at least one \(F_g\) differs, where \(F_g\) is the continuous distribution of group \(g=1,...G\). This test does not tell us which group is different.

To conduct the test, first denote individuals \(i=1,...n\) with overall ranks \(\hat{r}_1,....\hat{r}_{n}\). Each individual belongs to group \(g=1,...G\), and each group \(g\) has \(n_{g}\) individuals with average rank \(\bar{r}_{g} = \sum_{i} \hat{r}_{i} /n_{g}\). The Kruskal Wallis statistic is \[\begin{eqnarray} \hat{KW} &=& (n-1) \frac{\sum_{g=1}^{G} n_{g}( \bar{r}_{g} - \bar{r} )^2 }{\sum_{i=1}^{n} ( \hat{r}_{i} - \bar{r} )^2}, \end{eqnarray}\] where \(\bar{r} = \frac{n+1}{2}\) is the grand mean rank.

In the special case with only two groups, the Kruskal Wallis test reduces to the Mann–Whitney U test (also known as the Wilcoxon rank-sum test). In this case, we can write the hypotheses in terms of individual outcomes in each group, \(Y_i\) in one group \(Y_j\) in the other; \(H_0: Prob(Y_i > Y_j)=Prob(Y_i > Y_i)\) versus \(H_A: Prob(Y_i > Y_j) \neq Prob(Y_i > Y_j)\). The corresponding test statistic is \[\begin{eqnarray} \hat{U} &=& \min(\hat{U}_1, \hat{U}_2) \\ \hat{U}_g &=& \sum_{i\in g}\sum_{j\in -g} \Bigl[\mathbf 1( \hat{Y}_{i} > \hat{Y}_{j}) + \tfrac12\mathbf 1(\hat{Y}_{i} = \hat{Y}_{j})\Bigr]. \end{eqnarray}\]

Interpretation template:

  • If the Kruskal-Wallis p-value is large, keep the global null and stop.
  • If the p-value is small, report that at least one group differs and then inspect adjusted pairwise p-values.
  • Report both statistical and practical importance (distribution plots + effect size summaries), not only p-values.
Code
# Step 2: Global null test
kw <- kruskal.test(Murder ~ Region, data = dat)
kw
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Murder by Region
## Kruskal-Wallis chi-squared = 19.345, df = 3, p-value = 0.000232

# Step 3: Post-hoc pairwise tests, adjusted p-values
pw <- pairwise.wilcox.test(
    x = dat$Murder, g = dat$Region,
    p.adjust.method = "bonferroni")
pw$p.value
##                 Northeast       South North Central
## South         0.003660173          NA            NA
## North Central 1.000000000 0.007496818            NA
## West          0.427842562 0.018410457             1

20.3 Regression

Factor Variables.

So far, we have discussed cardinal data where the difference between units always means the same thing: e.g., \(4-3=2-1\). There are also factor variables

  • Ordered: refers to Ordinal data. The difference between units means something, but not always the same thing. For example, \(4th - 3rd \neq 2nd - 1st\).
  • Unordered: refers to Categorical data. The difference between units is meaningless. For example, \(D-C=?\)

To analyze either factor, we often convert them into indicator variables or dummies; \(\hat{D}_{c}=\mathbf{1}( \text{Factor} = c)\). One common case is if you have observations of individuals over time periods, then you may have two factor variables. An unordered factor that indicates who an individual is; for example \(\hat{D}_{i}=\mathbf{1}( \text{Individual} = i)\), and an order factor that indicates the time period; for example \(\hat{D}_{t}=\mathbf{1}( \text{Time} \in [\text{month}~ t, \text{month}~ t+1) )\). There are many other cases you see factor variables, including spatial ID’s in purely cross sectional data.

Be careful not to handle categorical data as if they were cardinal. E.g., generate city data with Leipzig=1, Lausanne=2, LosAngeles=3, … and then include city as if it were a cardinal number (that’s a big no-no). The same applied to ordinal data; PopulationLeipzig=2, PopulationLausanne=3, PopulationLosAngeles=1.

Code
N <- 1000
x <- runif(N,3,8)
e <- rnorm(N,0,0.4)
fo <- factor(rbinom(N,4,.5), ordered=T)
fu <- factor(rep(c('A','B'),N/2), ordered=F)
dA <- 1*(fu=='A')
y <- (2^as.integer(fo)*dA )*sqrt(x)+ 2*as.integer(fo)*e
dat_f <- data.frame(y,x,fo,fu)

With factors, you can still include them in the design matrix of an OLS regression. For example, \[\begin{eqnarray} \hat{Y}_{i} = \sum_{k=0}^{K} b_k \hat{X}_{ik} + \sum_{t}\hat{D}_{t}b_{t} \end{eqnarray}\] When, as commonly done, the factors are modeled as being additively seperable, they are modeled “fixed effects”.1 Simply including the factors into the OLS regression yields a “dummy variable” fixed effects estimator. The fixed effects and dummy variable approach are algebraically equal: same coefficients and same residuals. (See Hansen Econometrics, Theorem 17.1)

Code
library(fixest)
fe_reg1 <- feols(y~x|fo+fu, dat_f)
coef(fe_reg1)
##        x 
## 1.096426
fixef(fe_reg1)[1:2]
## $fo
##         0         1         2         3         4 
##  6.939604 10.735851 15.447870 24.641854 42.928758 
## 
## $fu
##         A         B 
##   0.00000 -24.40623

# Compare Coefficients
fe_reg0 <- lm(y~-1+x+fo+fu, dat_f)
coef( fe_reg0 )
##          x        fo0        fo1        fo2        fo3        fo4        fuB 
##   1.096426   6.939604  10.735851  15.447870  24.641854  42.928758 -24.406230

With fixed effects, we can also compute averages for each group: \(\hat{M}_{Yt}=\sum_{i}^{n_{t}} \hat{Y}_{it}/n_{t}\), where each period \(t\) has \(n_{t}\) observations denoted \(\hat{Y}_{it}\). We can then construct a between estimator: \(\hat{M}_{Yt} = b_{0} + \hat{M}_{Xt} b_{1}\). Or we can subtract the average from each group to construct a within estimator: \((\hat{Y}_{it} - \hat{M}_{Yt}) = (\hat{X}_{it}-\hat{M}_{Xt})b_{1}\).

Comparing Models

Recall that we can compare regression models using an F-Test.

Under some additional parametric assumptions about the data generating process, the F-statistic follows an \(F\) distribution. This case is well-studied historically, often under the title Analysis of Variance (ANOVA). An important case corresponds to the restricted model having no explanatory variables (i.e., our model is \(\hat{Y}_{i}=b_{0}\) and our predictions are \(\hat{y}_{i}=\hat{M}_{Y}\)).

Code
reg_full <- lm(Murder ~ Assault + UrbanPop + Rape, data = USArrests)
reg_none <- lm(Murder ~ 1, data=USArrests)
anova(reg_none, reg_full)
## Analysis of Variance Table
## 
## Model 1: Murder ~ 1
## Model 2: Murder ~ Assault + UrbanPop + Rape
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     49 929.55                                  
## 2     46 304.83  3    624.72 31.424 3.322e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Manual F-test
rss0 <- sum(resid(reg_none)^2) # restricted
rss1 <- sum(resid(reg_full)^2) # unrestricted
df0  <- df.residual(reg_none)
df1  <- df.residual(reg_full)
F    <- ((rss0 - rss1)/(df0-df1)) / (rss1/df1) # observed F stat
p    <- 1-pf(F, df0-df1, df1) # where F falls in the F-distribution
cbind(F, p)
##             F            p
## [1,] 31.42399 3.322431e-11

Whether you take a parametric or nonparametric approach to hypothesis testing, you can easily test whether variables are additively separable with an F test.

Code
# Empirical Example
reg1 <- lm(Murder~Assault+UrbanPop, USArrests)
reg2 <- lm(Murder~Assault*UrbanPop, USArrests)
anova(reg1, reg2)
## Analysis of Variance Table
## 
## Model 1: Murder ~ Assault + UrbanPop
## Model 2: Murder ~ Assault * UrbanPop
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     47 312.87                           
## 2     46 300.93  1    11.944 1.8258 0.1832
Code
## # Simulation Example
## N <- 1000
## x <- runif(N,3,8)
## e <- rnorm(N,0,0.4)
## fo <- factor(rbinom(N,4,.5), ordered=T)
## fu <- factor(rep(c('A','B'),N/2), ordered=F)
## dA <- 1*(fu=='A')
## y <- (2^as.integer(fo)*dA )*sqrt(x)+ 2*as.integer(fo)*e
## dat_f <- data.frame(y,x,fo,fu)
## 
## reg0 <- lm(y~-1+x+fo+fu, dat_f)
## reg1 <- lm(y~-1+x+fo*fu, dat_f)
## reg2 <- lm(y~-1+x*fo*fu, dat_f)
## 
## anova(reg0, reg2)
## anova(reg0, reg1, reg2)

  1. There are also random effects: the factor variable comes from a distribution that is uncorrelated with the regressors. This is rarely used in economics today, however, and are mostly included for historical reasons and special cases where fixed effects cannot be estimated due to data limitations.↩︎