11 Comparing Two Groups

11.1 Discrete Data

Suppose our data is given in the table shown below.

	\(x=0\)	\(x=1\)
\(y=1\)	\(12\)	\(8\)
\(y=2\)	\(18\)	\(6\)
\(y=3\)	\(10\)	\(16\)

First, we can compute conditional probabilities to examine how the distribution changes.

	\(x=0\)	\(x=1\)
\(y=1\)	\(12/40\)	\(8/30\)
\(y=2\)	\(18/40\)	\(6/30\)
\(y=3\)	\(10/40\)	\(16/30\)
	\(40/40\)	\(30/30\)

Second, we can compute the conditional mean \(\hat{M}_{Y|X}\) by looking within each row. Specifically \[\begin{eqnarray*} \hat{M}_{y|x=0} = 1 \frac{12}{40} + 2 \frac{18}{40} + 3 \frac{10}{40} = \frac{12+36+30}{40} = \frac{78}{40} \\ \hat{M}_{y|x=1} = 1 \frac{8}{30} + 2 \frac{6}{30} + 3 \frac{16}{30} = \frac{8+12+48}{30} = \frac{68}{30} \end{eqnarray*}\] Notice that \(\hat{M}_{y|x=0} < 2 < \hat{M}_{y|x=1}\), which shows that the mean value of \(Y\) is higher when \(X\) is higher. I.e., there is a positive association.

Test Yourself

We can compute the mean \(\hat{M}_{Y}\) using the marginal distributions, as already shown in the last chapter. To compute the mean, \(\hat{M}_{Y}\), we first compute the joint distribution by dividing each cell with \(12+8+18+6+10+16 = 70\) so that all cells sum to 1. We then sum up proportions along each row and column to compute the marginals. \[\begin{eqnarray*} \hat{p}_{x=0} = (12+18+10)/70 = 40/70,\quad \hat{p}_{x=1} = (8+6+6)/70 = 30/70.\\ \hat{p}_{y=1} = (12+8)/70 = 20/70 ,\quad \hat{p}_{y=2} = (18+6)/70 = 24/70 ,\quad \hat{p}_{y=3} = (10+16)/70 = 26/70. \end{eqnarray*}\]

	\(x=0\)	\(x=1\)	Marginal y
\(y=1\)	\(12/70\)	\(8/70\)	\(20/70\)
\(y=2\)	\(18/70\)	\(6/70\)	\(24/70\)
\(y=3\)	\(10/70\)	\(16/70\)	\(26/70\)
Marginal x	\(40/70\)	\(30/70\)	\(70/70\)

Second, we compute the overall mean using the marginals \[\begin{eqnarray*} \hat{M}_{Y} = 1 \frac{20}{70} + 2 \frac{24}{70} + 3 \frac{26}{70} = \frac{20 + 48 + 78}{70} = {146/70} \end{eqnarray*}\]

We can also explicitly compute the conditional mean using joint and marginal probabilities in two additional steps:

Third, we compute conditional probabilities; for \(\hat{p}_{y \mid x}\), each column sums to 1. \[\begin{eqnarray*} \hat{p}_{y=1 \mid x=0} = \frac{12/70}{40/70} = \frac{12}{40}, \qquad \hat{p}_{y=1 \mid x=1} = \frac{8/70}{30/70} = \frac{8}{30}\\ \hat{p}_{y=2 \mid x=0} = \frac{18/70}{40/70} = \frac{18}{40}, \qquad \hat{p}_{y=2 \mid x=1} = \frac{6/70}{30/70} = \frac{6}{30}\\ \hat{p}_{y=3 \mid x=0} = \frac{10/70}{40/70} = \frac{10}{40}, \qquad \hat{p}_{y=3 \mid x=1} = \frac{16/70}{30/70} = \frac{16}{30} \end{eqnarray*}\]

Finally, we can compute conditional means: \[\begin{eqnarray*} \hat{M}_{y|x=0} = 1 \frac{12}{40} + 2 \frac{18}{40} + 3 \frac{10}{40} = \frac{12+36+30}{40} = \frac{78}{40} \\ \hat{M}_{y|x=1} = 1 \frac{8}{30} + 2 \frac{6}{30} + 3 \frac{16}{30} = \frac{8+12+48}{30} = \frac{68}{30} \end{eqnarray*}\]

Notice that \(\hat{M}_{y|x=0} < 2 < \hat{M}_{y|x=1}\), which shows that higher values of \(X\) are associated with a higher mean of \(Y\).

11.2 Mixed Data

For mixed data, \(\hat{Y}_{i}\) is a cardinal variable and \(\hat{X}_{i}\) is a factor variable (typically unordered). For such “grouped data”, we analyze associations via group comparisons. The basic idea is best seen in a comparison of two samples, which corresponds to an \(\hat{X}_{i}\) with two categories. For example, the heights of men and women in Canada or the homicide rates in two different American states.

We will consider the wages for people with and without completing a degree. To have this data on our computer, we must first install the “Ecdat” package once. Then we can access the data at any time by loading the package.

Code

# Install R Data Package and Load in
install.packages('Ecdat') # only once
library('Ecdat') # 'load' anytime you want to use the data
?Wages1 # details about the dataset we want
head(Wages1)

To compare wages for people with and without completing a degree, we start by making a histogram for both groups.

Code

library('Ecdat')
## Wages
Y <- Wages1[, 'wage']
## Group 1: Not Graduates
X1_id <- Wages1[, 'school'] == 12
Y1 <- Y[X1_id]
## Group 2: Graduates
X2_id <- Wages1[, 'school'] == 16
Y2 <- Y[X2_id]


# Initial Summary Figure
bks <- seq(0, 34, by=1)
dlim <- c(0, .2)
cols <- c(rgb(1, 0, 0, .5), rgb(0, 0, 1, .5))

hist(Y1, breaks=bks, ylim=dlim,
     col=cols[1], xlab='Wages',
     freq=F, border=NA, main=NA)
hist(Y2, breaks=bks, ylim=dlim,
     col=cols[2],
     freq=F, border=NA, add=T)
legend('topright',
       col=cols, pch=15,
       legend=c('12 Years (Highschool)', '16 Years (University)'),
       title='School Completed')

The histogram may show several differences between groups or none at all. Often, the first statistic we investigate for hypothesis testing is the mean.

Mean Differences.

We often want to know if the means of different samples are the same or different. To test this hypothesis, we compute the means separately for each sample and then examine the differences term \[\begin{eqnarray} \hat{D} = \hat{M}_{Y1} - \hat{M}_{Y2}, \end{eqnarray}\] with a null hypothesis that there is no difference in the population means. We can do this either by “inverting a Confidence Interval” or “imposing the null”.

Must Know

Here is an example of inverting a CI

Code

# Differences between means
m1 <- mean(Y1)
m2 <- mean(Y2)
d <- m1-m2
    
# Bootstrap Distribution
bootstrap_diff <- rep(NA, 9999)
for(b in seq(bootstrap_diff) ){
    Y1_b <- sample(Y1, replace=T)
    Y2_b <- sample(Y2, replace=T)
    m1_b <- mean(Y1_b)
    m2_b <- mean(Y2_b)
    d_b <- m1_b - m2_b
    bootstrap_diff[b] <- d_b
}
hist(bootstrap_diff,
    border=NA, freq=F,
    main=NA, xlab='Difference in Means')
title('Bootstrapped Difference in Means', font.main=1)

# 2-Sided Test via Confidence Interval
boot_ci <- quantile(bootstrap_diff, probs=c(.025, .975))
abline(v=boot_ci, lwd=2)
abline(v=0, lwd=2, col=rgb(1, 0, 0, .8))

For “imposing the null”, it is more common to use \(p\)-values instead of confidence intervals. We can compute \(p\)-values using a null-bootstrap or permutation distribution. If using a hard decision rule, it is most common to use

reject the null at the \(5\%\) level if \(p<0.05\)
fail to reject the null at the \(5\%\) level if \(p>0.05\)

Must Know

Code

# Null Bootstrap Distribution
permute_diff <- rep(NA, 9999)
for(b in seq(permute_diff) ){
    #randomize data
    Y_b  <- sample(Y, replace=F)
    Y1_b <- Y_b[X1_id]
    Y2_b <- Y_b[X2_id]
    # recompute statistic
    m1_b <- mean(Y1_b)
    m2_b <- mean(Y2_b)
    d_b <- m1_b - m2_b
    permute_diff[b] <- d_b
}
hist(permute_diff,
    border=NA, freq=F,
    main=NA, xlab='Difference in Means')
title('Permuted Difference in Means', font.main=1)
abline(v=d, col=rgb(1, 0, 0, .8))

Code


Fhat_abs0 <- ecdf(abs(permute_diff))
p <- 1 - Fhat_abs0( abs(d) )
p
## [1] 0.0240024

Those are purely statistical statements that only speak to how frequent differences are generated by random chance. They do not say why there are any differences or how large they are. Even on purely statistical grounds, however, we would want to be cautious about a making data-driven decisions when

the significance level is somewhat arbitrary (what if we reject at the \(5\%\) level but not \(10\%\) level?)
the test statistic is somewhat arbitrary (what if we reject for means but not medians?)
there are multiple differences

For such reasons, applied statisticians consider many statistics

Quantile Differences.

The above procedure generalized from differences in means to other quantiles statistics like medians. To start, we plot the ECDF’s for both groups.

Code

# Quantile Comparison

## Distribution 1
F1 <- ecdf(Y1)
plot(F1, col=cols[1],
     pch=16, xlab='Wages', xlim=c(0, 24),
     main=NA, bty='n')
title('Comparing Medians', font.main=1)
## Median 1
med1 <- quantile(F1, probs=0.5)
segments(med1, 0, med1, 0.5, col=cols[1], lty=2)
abline(h=0.5, lty=2)

## Distribution 2
F2 <- ecdf(Y2)
plot(F2, add=TRUE, col=cols[2], pch=16)
## Median 2
med2 <- quantile(F2, probs=0.5)
segments(med2, 0, med2, 0.5, col=cols[2], lty=2)

## Legend
legend('bottomright',
       col=cols, pch=15,
       legend=c('12 Years (Highschool)', '16 Years (University)'),
       title='School Completed')

Must Know

Code

# Bootstrap Distribution Function
boot_quant <- function(Y1, Y2, B=9999, probs=0.5, ...){
    bootstrap_diff <- rep(NA, B)
    for(b in seq(bootstrap_diff)){
        Y1_b <- sample(Y1, replace=T)
        Y2_b <- sample(Y2, replace=T)
        q1_b <- quantile(Y1_b, probs=probs, ...)
        q2_b <- quantile(Y2_b, probs=probs, ...)
        d_b <- q1_b - q2_b
        bootstrap_diff[b] <- d_b
    }
    return(bootstrap_diff)
}

# 2-Sided Test for Median Differences
# d <- quantile(Y2, probs=0.5) - quantile(Y1, probs=0.5)
boot_d <- boot_quant(Y1, Y2, B=999, probs=0.5)
hist(boot_d, border=NA, freq=F,
    main=NA, xlab='Difference in Medians')
title('Bootstrap Difference in Medians', font.main=1)
abline(v=quantile(boot_d, probs=c(.025, .975)), lwd=2)
abline(v=0, lwd=2, col=rgb(1, 0, 0, .8))

The above procedure is quite general and extends to other quantiles. Note that bootstrap tests can perform poorly with highly unequal variances or skewed data. To see this yourself, make a simulation with skewed data and unequal variances.

Other Statistics.

In principle, we can also examine whether there are differences in spread (sd or IQR) or shape (skew or kurtosis). We use the same hypothesis testing procedure as above

Test Yourself

Here is an example to look at differences in “spread”

Code

boot_fun <- function( fun, B=9999, ...){
    bootstrap_diff <- rep(NA, B)
    for(b in seq(bootstrap_diff)){
        Y1_b <- sample(Y1, replace=T)
        Y2_b <- sample(Y2, replace=T)
        f1_b <- fun(Y1_b, ...)
        f2_b <- fun(Y2_b, ...)
        d_b <- f1_b - f2_b
        bootstrap_diff[b] <- d_b
    }
    return(bootstrap_diff)
}

# 2-Sided Test for SD Differences
#d <- sd(Y2) - sd(Y1)
boot_d <- boot_fun(sd)
hist(boot_d, border=NA, font.main=1,
    main='Difference in Standard Deviations')
abline(v=quantile(boot_d, probs=c(.025, .975)), lwd=2)
abline(v=0, lwd=2, col=2)
1 - ecdf(boot_d)(0)


# Try any function!
# boot_fun( function(xs) { IQR(xs)/median(xs) } )

11.3 Distributional Comparisons

We can also examine whether there are any differences between the entire distributions. We typically start by plotting the data using ECDF’s or a boxplot, and then calculate a statistic for hypothesis testing. Which plot and test statistic depends on how many groups there are.

Two groups.

One useful visualization for two groups is to plot the quantiles against one another: a quantile-quantile plot. I.e., the first data point on the bottom left shows the first quantile for both distributions.

Code

# Wage Data (same as from before)
#library(Ecdat)
#Y1 <- sort( Wages1[Wages1[, 'school'] == 12,  'wage'])
#Y2 <- sort( Wages1[Wages1[, 'school'] == 16,  'wage'] )

# Compute Quantiles
quants <- seq(0, 1, length.out=101)
Q1 <- quantile(Y1, probs=quants)
Q2 <- quantile(Y2, probs=quants)

# Compare Distributions via Quantiles
#ry <- range(c(Y1, Y2))
#plot(ry, c(0, 1), type='n', font.main=1,
#    main='Distributional Comparison',
#    xlab='Quantile',
#    ylab='Probability')
#lines(Q1, quants, col=2)
#lines(Q2, quants, col=4)
#legend('bottomright', col=c(2, 4), lty=1,
#   legend=c(
#       expression(hat(F)[1]),
#       expression(hat(F)[2]) 
#))

# Compare Quantiles
ry <- range(c(Y1, Y2))
plot(Q1, Q2, xlim=ry, ylim=ry,
    xlab=expression(Q[1]),
    ylab=expression(Q[2]),
    main=NA,
    pch=16, col=grey(0, .25))
title('Quantile-Quantile Plot', font.main=1)
abline(a=0, b=1, lty=2)

The starting point for hypothesis testing is the Kolmogorov-Smirnov statistic: the maximum absolute difference between two CDF’s over all sample data \(y \in \{Y_1\} \cup \{Y_2\}\). \[\begin{eqnarray} \hat{KS} &=& \max_{y} |\hat{F}_{1}(y)- \hat{F}_{2}(y)|^{p}, \end{eqnarray}\] where \(p\) is an integer (typically 1). An intuitive alternative is the Cramer-von Mises statistic: the sum of absolute differences (raised to an integer, typically 2) between two CDF’s. \[\begin{eqnarray} \hat{CVM} &=& \sum_{y} | \hat{F}_{1}(y)- \hat{F}_{2}(y)|^{p}. \end{eqnarray}\]

Code

# Distributions
y <- sort(c(Y1, Y2))
F1 <- ecdf(Y1)(y)
F2 <- ecdf(Y2)(y)

library(twosamples)

# Kolmogorov-Smirnov
KSq <- which.max(abs(F2 - F1))
KSqv <- round(twosamples::ks_stat(Y1, Y2), 2)

# Cramer-von Mises Statistic (p=2)
CVMqv <- round(twosamples::cvm_stat(Y1, Y2, power=2), 2) 

# Visualize Differences
plot(range(y), c(0, 1), type='n', xlab='x', ylab='ECDF')
lines(y, F1, col=cols[1], lwd=2)
lines(y, F2, col=cols[2], lwd=2)

# KS
title( paste0('KS: ', KSqv), adj=0, font.main=1)
segments(y[KSq], F1[KSq], y[KSq], F2[KSq],
    lwd=3, col=grey(0, .75), lty=2)

# CVM
title( paste0('CVM: ', CVMqv), adj=1, font.main=1)
segments(y, F1, y, F2, lwd=.25, col=grey(0, .2))

Must Know

Suppose there are two groups of summer workers, with data on the earnings of each individual. In thousands of Canadian Dollars, specifically, Group \(A\) has earnings \(\{18,20,25,30\}\) and group \(B\) has earnings \(\{19,21,24,35\}\).

We can first compare the distributions by drawing a QQ plot. We sort both samples and plot Group \(A\) quantiles (x-axis) against Group \(B\) quantiles (y-axis): \((18,19),\ (20,21),\ (25,24),\ (30,35)\). The 45° reference line indicates identical distributions: point 1 lies above the line (18,18) and so do points 2 and 4.
We can then compute the CVM statistic for further statistical testing. Start by drawing the ECDF’s. Then compute \((\hat{F}_A(x) - \hat{F}_B(x))^2\) at each of the 8 pooled sorted values:

\(x\)	\(\hat{F}_A(x)\)	\(\hat{F}_B(x)\)	\((\hat{F}_A-\hat{F}_B)^2\)
18	1/4	0	1/16
19	1/4	1/4	0
20	2/4	1/4	1/16
21	2/4	2/4	0
24	2/4	3/4	1/16
25	3/4	3/4	0
30	4/4	3/4	1/16
35	4/4	4/4	0

\(CVM = 1/16 + 1/16 + 1/16 + 1/16 = 1/4\).

Just as before, you use bootstrapping for hypothesis testing.

Code

twosamples::ks_test(Y1, Y2)
## Test Stat   P-Value 
## 0.4181397 0.0055000

twosamples::cvm_test(Y1, Y2)
## Test Stat   P-Value 
##  76.16368   0.00250

Test Yourself

Compare the distribution of arrests for two different counties, each with data over time.

Code

library(wooldridge)
countymurders

These methods extend naturally from two groups to many. See Multiple Groups for the ANOVA approach to comparing \(G\) groups simultaneously.

11.4 Exercises

A permutation test shuffles the group labels to build a null distribution, while a bootstrap test resamples with replacement from each group separately. Explain why permutation is appropriate for testing “no difference between groups” and when you might prefer the bootstrap instead.
Using the Wages1 data from the Ecdat package, compute the difference in mean wages \(\hat{D} = \hat{M}_{Y1} - \hat{M}_{Y2}\) between workers with \(12\) years of schooling and workers with \(16\) years of schooling. Then compute the Kolmogorov-Smirnov statistic \(\hat{KS}\) between the two wage distributions by hand (i.e., find the maximum absolute difference between the two ECDF’s evaluated at the pooled sorted data).
Using the mtcars dataset, split cars into two groups based on transmission type (am: 0 = automatic, 1 = manual). Write a permutation test with \(B = 9999\) iterations to test whether the difference in mean mpg between automatic and manual cars is statistically different from zero. Report the \(p\)-value.

11.1 Discrete Data

11.2 Mixed Data

Mean Differences.

Quantile Differences.

Other Statistics.

11.3 Distributional Comparisons

Two groups.

11.4 Exercises

Further Reading.