11  Conditional Relationships


11.1 Conditional Distributions

Discrete Data.

In contrast to marginal and joint distributions, conditional distributions describe how the distribution of one variable changes when we restrict attention to a subgroup defined by another variable. Formally, if \(\hat{p}_{xy}\) denotes the empirical joint distribution, then the empirical conditional distribution of \(Y_{i}\) given \(X_{i}=x\) is \[\begin{eqnarray} \hat{p}_{y \mid x} = \frac{\hat{p}_{xy}}{\hat{p}_x}, \end{eqnarray}\] where \(\hat{p}_{x} = \sum_y \hat{p}_{xy}\).

For example, suppose we observe a sample of \(n=13\) students with two discrete variables:

  • \(X_{i}\) depicts years of education, taking values in \(\{12, 14\}\)
  • \(Y_{i}\) depicts sex, where \(y=1\) means female and \(y=0\) means male

Assume the count data are summarized by the following frequency table:

\[\begin{array}{c|cc|c} & y=0\ (\text{male}) & y=1\ (\text{female}) & \text{Row total}\\ \hline x=12 & 4/13 & 3/13 & 7/13\\ x=14 & 1/13 & 5/13 & 6/13\\ \hline \text{Column total} & 5/13 & 7/13 & 1 \end{array}\]

We will compute the conditional distribution of \(Y_{i}\) given \(X_{i}=x\). For \(x=12\), we compute \[\begin{eqnarray} \hat{p}_{y=0 \mid x=12} &=& \frac{\hat{p}_{x=12, y=0}}{\hat{p}_{x=12}} = \frac{4/13}{7/13} = \frac{4}{7} \approx 0.57. \\ \hat{p}_{y=1\mid x=12} &=& \frac{\hat p_{x=12, y=1}}{\hat{p}_{x=12}} = \frac{3/13}{7/13} = \frac{3}{7} \approx 0.43. \end{eqnarray}\] Similarly, for \(x=14\), we compute \[\begin{eqnarray} \hat{p}_{y=0 \mid x=14} &=& \frac{\hat{p}_{x=14, y=0}}{\hat{p}_{x=14}} = \frac{1/13}{6/13} = \frac{1}{6} \approx 0.17. \\ \hat{p}_{y=1\mid x=14} &=& \frac{\hat p_{x=14, y=1}}{\hat{p}_{x=14}} = \frac{5/13}{6/13} = \frac{5}{6} \approx 0.83. \end{eqnarray}\]

In this example, we say that

  • conditional on students having \(12\) years of education, \(\approx 57\%\) are male and \(\approx 43\%\) are female.
  • conditional on students having \(14\) years of education, \(\approx 17\%\) are male and \(\approx 83\%\) are female.

We can also compute the conditional distribution of \(X_{i}\) given \(Y_{i}=x\), just as we did above.

Continuing the example above, show that

  • among male students, \(\approx 80\%\) have 12 years of education.
  • among female students, \(\approx 38\%\) have 12 years of education.

Try programming the results, especially if you are stuck or uncertain

Code
# Counts implied by the table:
# X=12: 4 male, 3 female
# X=14: 1 male, 5 female
X <- c(rep(12,7), rep(14,6))
Y <- c(rep(0,4), rep(1,3),
       rep(0,1), rep(1,5))

dat <- data.frame(educ=X, female=Y)

tab <- table(dat)
tab
##     female
## educ 0 1
##   12 4 3
##   14 1 5

# Joint distribution
round(prop.table(tab), 3)
##     female
## educ     0     1
##   12 0.308 0.231
##   14 0.077 0.385

# Conditional distribution of Y given X
round(prop.table(tab, 1), 3)
##     female
## educ     0     1
##   12 0.571 0.429
##   14 0.167 0.833

# Conditional distribution of X given Y
round(prop.table(tab, 2), 3)
##     female
## educ     0     1
##   12 0.800 0.375
##   14 0.200 0.625

Conditional distributions change the unit of analysis:

  • \(\hat{p}_{y=1 \mid x=14}\) answers “what fraction of student are female within the subgroup with 14 years of education?”
  • \(\hat{p}_{x=14 \mid y=1}\) answers “what fraction of students have 14 years of education within the subgroup that is female?”

Simpson’s Paradox.

Frequency tables can be tricky, especially when using percents, so take your time making and interpreting them. Remember that a ratio can change due changes in either the numerator or the denominator.

Consider the following example, which shows that \(400\) men and \(400\) women applied to university and that only a share were accepted in each department (English or Engineering). If you computed percents for each sex within each cell, you would find men have a higher overall acceptance rate, but women have higher acceptance rates within each department! This counterintuitive phenomena is known as Simpson’s Paradox, and the example is part of a real debate about discrimination (http://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf).

Even if women have higher admission rates within both departments, women can still have a lower overall admission rate if women disproportionately apply to the more selective department (English) and men disproportionately apply to the less selective department (Engineering). The same issue is relevant for a variety of labor market issues, such as the gender pay gap, and a great many other social issues, such as why are some countries rich and others poor.

School Applicants (Admitted), by Sex and Department
Department Men Women Total
English 100 (40, \(40\%\)) 350 (150, \(43\%\)) 450 (190, \(42\%\))
Engineering 300 (160, \(53\%\)) 50 (30, \(60\%\)) 350 (190, \(54\%\))
Total 400 (200, \(50\%\)) 400 (180, \(45\%\)) 800 (380, \(48\%\))

Continuous Data.

These describe the relationship between \(\hat{Y}_{i}\) and \(\hat{X}_{i}\). We show how distribution or density of \(Y\) changes according to \(X\). When \(X\) is continuous, as it often is, we split it into distinct bins and convert it to a factor variable. E.g.,

Code
# Split Data by Urban Population above/below mean
pop_mean <- mean(USArrests[,'UrbanPop'])
pop_cut <- USArrests[,'UrbanPop']< pop_mean
murder_lowpop <- USArrests[pop_cut,'Murder']
murder_highpop <- USArrests[!pop_cut,'Murder']
cols <- c(low=rgb(0,0,1,.75), high=rgb(1,0,0,.75))

# Common Histogram 
ylim <- c(0,.25)
xbks <-  seq(min(USArrests[,'Murder'])-1, max(USArrests[,'Murder'])+1, by=1)

par(mfrow=c(1,2))
hist(murder_lowpop,
    breaks=xbks, col=cols[1],
    main='Urban Pop >= Mean', font.main=1,
    xlab='Murder Arrests', freq=F,
    border=NA, ylim=ylim)

hist(murder_highpop,
    breaks=xbks, col=cols[2],
    main='Urban Pop < Mean', font.main=1,
    xlab='Murder Arrests', freq=F,
    border=NA, ylim=ylim)

It is sometimes it is preferable to show the ECDF instead. And you can glue various combinations together to convey more information all at once

Code
layout( t(c(1,2,2)))
# Full Sample Density
hist(USArrests[,'Murder'], 
    main='Full Sample Density', font.main=1,
    xlab='Murder Arrests',
    breaks=xbks, freq=F, border=NA)

# Split Sample Distribution Comparison
F_lowpop <- ecdf(murder_lowpop)
plot(F_lowpop, col=cols[1],
    pch=16, xlab='Murder Arrests',
    main='Split Sample Distributions',
    font.main=1, bty='n')
F_highpop <- ecdf(murder_highpop)
plot(F_highpop, add=T, col=cols[2], pch=16)

legend('bottomright', col=cols,
    pch=16, bty='n', inset=c(0,.1),
    title='% Urban Pop.',
    legend=c('Low (<= Mean)','High (>= Mean)'))

Code
# Simple Interactive Scatter Plot
# plot(Assault~UrbanPop, USArrests, col=grey(0,.5), pch=16,
#    cex=USArrests[,'Murder']/diff(range(USArrests[,'Murder']))*2,
#    main='US Murder arrests (per 100,000)')

You can also split data into more than two groups. For more than three groups, boxplots are often more effective than histograms or ECDF’s.

Code
# K Groups with even spacing (not even counts)
K <- 4
USArrests[,'UrbanPop_Kcut'] <- cut(USArrests[,'UrbanPop'],K)
table(USArrests[,'UrbanPop_Kcut'] )
## 
## (31.9,46.8] (46.8,61.5] (61.5,76.2] (76.2,91.1] 
##           6          13          17          14

# Full sample
#boxplot(USArrests[,'Murder'], main='',
#    xlab='All Data', ylab='Murder Arrests')

# Boxplots for each group
Kcols <- hcl.colors(K,alpha=.5)
boxplot(Murder~UrbanPop_Kcut, USArrests,
    main='', col=Kcols, 
    varwidth=T, #show number of obs. per group
    xlab='Urban Population', ylab='')

Code

# 4 Groups with equal numbers of observations
#Qcuts <- c(
#    '0%'=min(USArrests[,'UrbanPop'])-10*.Machine[['double.eps']],
#    quantile(USArrests[,'UrbanPop'], probs=c(.25,.5,.75,1)))
#USArrests[,'UrbanPop']_cut <- cut(USArrests[,'UrbanPop'], Qcuts)
#boxplot(Murder~UrbanPop_cut, USArrests, col=hcl.colors(4,alpha=.5))

11.2 Group Differences

For mixed data, \(\hat{Y}_{i}\) is a cardinal variable and \(\hat{X}_{i}\) is a factor variable (typically unordered). For such “grouped data”, we analyze associations via group comparisons. The basic idea is best seen in a comparison of two samples, which corresponds to an \(\hat{X}_{i}\) with two categories. For example, the heights of men and women in Canada or the homicide rates in two different American states. For another example, the wages for people with and without completing a degree.

Code
library(wooldridge)
## Group 1
X1_id <- wage1[,'educ'] == 15
Y1 <- wage1[X1_id, 'wage']
## Group 2
X2_id <- wage1[,'educ'] == 16
Y2 <- wage1[X2_id, 'wage']

# Initial Summary Figure
bks <- seq(0, 24, by=1.5)
dlim <- c(0,.2)
cols <- c(rgb(1,0,0,.5), rgb(0,0,1,.5))

hist(Y1, breaks=bks, ylim=dlim,
     col=cols[1], xlab='Wages',
     freq=F, border=NA, main='')
hist(Y2, breaks=bks, ylim=dlim,
     col=cols[2],
     freq=F, border=NA, add=T)
legend('topright',
       col=cols, pch=15,
       legend=c('15 Years', '16 Years'),
       title='School Completed')

There may be several differences between groups. Often, the first statistic we investigate is the difference in means.

Mean Differences.

We often want to know if the means of different sample are different. To test this hypothesis, we compute the means separately for each sample and then examine the differences term \[\begin{eqnarray} \hat{D} = \hat{M}_{Y1} - \hat{M}_{Y2}, \end{eqnarray}\] with a null hypothesis that there is no difference in the population means.

Code
# Differences between means
m1 <- mean(Y1)
m2 <- mean(Y2)
d <- m1-m2
    
# Bootstrap Distribution
bootstrap_diff <- vector(length=9999)
for(b in seq(bootstrap_diff) ){
    Y1_b <- sample(Y1, replace=T)
    Y2_b <- sample(Y2, replace=T)
    m1_b <- mean(Y1_b)
    m2_b <- mean(Y2_b)
    d_b <- m1_b - m2_b
    bootstrap_diff[b] <- d_b
}
hist(bootstrap_diff,
    border=NA, font.main=1,
    main='Difference in Means')

# 2-Sided Test via Confidence Interval
boot_ci <- quantile(bootstrap_diff, probs=c(.025, .975))
abline(v=boot_ci, lwd=2)
abline(v=0, lwd=2, col=2)

Just as with one sample tests, we can compute a standardized differences, where \(D\) is converted into a \(t\) statistic. Note, however, that we have to compute the standard error for the difference statistic, which is a bit more complicated. However, this allows us to easily conduct one or two sided hypothesis tests using a standard normal approximation.

Code
se_hat <- sqrt(var(Y1)/n1 + var(Y2)/n2);
t_obs <- d/se_hat

t_2sample <- function(Y1, Y2){
    # Differences between means
    m1 <- mean(Y1)
    m2 <- mean(Y2)
    d <- (m1-m2)

    # SE estimate
    n1  <- length(Y1)
    n2  <- length(Y2)
    s1  <- var(Y1)
    s2  <- var(Y2)
    s   <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2)
    d_se <- sqrt(s*(1/n1+1/n2))

    # t stat
    t_stat <- d/d_se
    return(t_stat)
}
 
tstat <- twosam(data[,'male'], data[,'female'])
tstat

Quantile Differences.

The above procedure generalized from differences in means to other quantiles statistics like medians.

Code
# Quantile Comparison

## Distribution 1
F1 <- ecdf(Y1)
plot(F1, col=cols[1],
     pch=16, xlab='Wages',
     main='Comparing Medians',
     font.main=1, bty='n')
## Median 1
med1 <- quantile(F1, probs=0.5)
segments(med1, 0, med1, 0.5, col=cols[1], lty=2)
abline(h=0.5, lty=2)

## Distribution 2
F2 <- ecdf(Y2)
plot(F2, add=TRUE, col=cols[2], pch=16)
## Median 2
med2 <- quantile(F2, probs=0.5)
segments(med2, 0, med2, 0.5, col=cols[2], lty=2)

## Legend
legend('bottomright',
       col=cols, pch=15,
       legend=c('Grade 15', 'Grade 16'),
       title='School Completed')

Code
# Bootstrap Distribution Function
boot_quant <- function(Y1, Y2, B=9999, prob=0.5, ...){
    bootstrap_diff <- vector(length=B)
    for(b in seq(bootstrap_diff)){
        Y1_b <- sample(Y1, replace=T)
        Y2_b <- sample(Y2, replace=T)
        q1_b <- quantile(Y1_b, probs=0.5, ...)
        q2_b <- quantile(Y2_b, probs=0.5, ...)
        d_b <- q1_b - q2_b
        bootstrap_diff[b] <- d_b
    }
    return(bootstrap_diff)
}

# 2-Sided Test for Median Differences
# d <- median(Y2) - median(Y1)
boot_d <- boot_quant(Y1, Y1, B=999, prob=0.5)
hist(boot_d, border=NA, font.main=1,
    main='Difference in Medians')
abline(v=quantile(boot_d, probs=c(.025, .975)), lwd=2)
abline(v=0, lwd=2, col=2)

Code
1 - ecdf(boot_d)(0)
## [1] 0.4264264

If we want to test for the differences in medians across groups with independent observations, we can also use notches in the boxplot. If the notches of two boxes do not overlap, then there is rough evidence that the difference in medians is statistically significant. The square root of the sample size is also shown as the bin width in each boxplot.1

Code
boxplot(Y1, Y2,
    col=cols,
    notch=T,
    varwidth=T)

Note that bootstrap tests can perform poorly with highly unequal variances or skewed data. To see this yourself, make a simulation with skewed data and unequal variances.

In principle, we can also examine whether there are differences in spread or shape statistics such as sd and IQR, or skew and kurtosis. More often, however, we examine whether there are any differences in the distributions.

Here is an example to look at differences in “spread”

Code
boot_fun <- function( fun, B=9999, ...){
    bootstrap_diff <- vector(length=B)
    for(b in seq(bootstrap_diff)){
        Y1_b <- sample(Y1, replace=T)
        Y2_b <- sample(Y2, replace=T)
        f1_b <- fun(Y1_b, ...)
        f2_b <- fun(Y2_b, ...)
        d_b <- f1_b - f2_b
        bootstrap_diff[b] <- d_b
    }
    return(bootstrap_diff)
}

# 2-Sided Test for SD Differences
#d <- sd(Y2) - sd(Y1)
boot_d <- boot_fun(sd)
hist(boot_d, border=NA, font.main=1,
    main='Difference in Standard Deviations')
abline(v=quantile(boot_d, probs=c(.025, .975)), lwd=2)
abline(v=0, lwd=2, col=2)
1 - ecdf(boot_d)(0)


# Try any function!
# boot_fun( function(xs) { IQR(xs)/median(xs) } )

11.3 Distributional Comparisons

We can also examine whether there are any differences between the entire distributions. We typically start by plotting the data using ECDF’s or a boxplot, and then calculate a statistic for hypothesis testing. Which plot and test statistic depends on how many groups there are.

Two groups.

One useful visualization for two groups is to plot the quantiles against one another: a quantile-quantile plot. I.e., the first data point on the bottom left shows the first quantile for both distributions.

Code
# Wage Data (same as from before)
#library(wooldridge)
#Y1 <- sort( wage1[wage1[,'educ'] == 15,  'wage'])  
#Y2 <- sort( wage1[wage1[,'educ'] == 16,  'wage'] )

# Compute Quantiles
quants <- seq(0,1,length.out=101)
Q1 <- quantile(Y1, probs=quants)
Q2 <- quantile(Y2, probs=quants)

# Compare Distributions via Quantiles
#ry <- range(c(Y1, Y2))
#plot(ry, c(0,1), type='n', font.main=1,
#    main='Distributional Comparison',
#    xlab="Quantile",
#    ylab="Probability")
#lines(Q1, quants, col=2)
#lines(Q2, quants, col=4)
#legend('bottomright', col=c(2,4), lty=1,
#   legend=c(
#       expression(hat(F)[1]),
#       expression(hat(F)[2]) 
#))

# Compare Quantiles
ry <- range(c(Y1, Y2))
plot(Q1, Q2, xlim=ry, ylim=ry,
    xlab=expression(Q[1]),
    ylab=expression(Q[2]),
    main='Quantile-Quantile Plot',
    font.main=1,
    pch=16, col=grey(0,.25))
abline(a=0,b=1,lty=2)

The starting point for hypothesis testing is the Kolmogorov-Smirnov Statistic: the maximum absolute difference between two CDF’s over all sample data \(y \in \{Y_1\} \cup \{Y_2\}\). \[\begin{eqnarray} \hat{KS} &=& \max_{y} |\hat{F}_{1}(y)- \hat{F}_{2}(y)|^{p}, \end{eqnarray}\] where \(p\) is an integer (typically 1). An intuitive alternative is the Cramer-von Mises Statistic: the sum of absolute differences (raised to an integer, typically 2) between two CDF’s. \[\begin{eqnarray} \hat{CVM} &=& \sum_{y} | \hat{F}_{1}(y)- \hat{F}_{2}(y)|^{p}. \end{eqnarray}\]

Code
# Distributions
y <- sort(c(Y1, Y2))
F1 <- ecdf(Y1)(y)
F2 <- ecdf(Y2)(y)

library(twosamples)

# Kolmogorov-Smirnov
KSq <- which.max(abs(F2 - F1))
KSqv <- round(twosamples::ks_stat(Y1, Y2),2)

# Cramer-von Mises Statistic (p=2)
CVMqv <- round(twosamples::cvm_stat(Y1, Y2, power=2), 2) 

# Visualize Differences
plot(range(y), c(0,1), type="n", xlab='x', ylab='ECDF')
lines(y, F1, col=cols[1], lwd=2)
lines(y, F2, col=cols[2], lwd=2)

# KS
title( paste0('KS: ', KSqv), adj=0, font.main=1)
segments(y[KSq], F1[KSq], y[KSq], F2[KSq], lwd=1.5, col=grey(0,.75), lty=2)

# CVM
title( paste0('CVM: ', CVMqv), adj=1, font.main=1)
segments(y, F1, y, F2, lwd=.5, col=grey(0,.2))

Just as before, you use bootstrapping for hypothesis testing.

Code
twosamples::ks_test(Y1, Y2)
## Test Stat   P-Value 
## 0.2892157 0.0970000

twosamples::cvm_test(Y1, Y2)
## Test Stat   P-Value 
##  2.084253  0.089000

Compare the distribution of arrests for two different counties, each with data over time.

Code
library(wooldridge)
countymurders

Multiple groups.

With multiple groups, you will want to begin with a summary figure (such as a boxplot). We can also tests the equality of all distributions (whether at least one group is different). The Kruskal-Wallis test examines \(H_0:\; F_1 = F_2 = \dots = F_G\) versus \(H_A:\; \text{at least one } F_g \text{ differs}\), where \(F_g\) is the continuous distribution of group \(g=1,...G\). This test does not tell us which group is different.

To conduct the test, first denote individuals \(i=1,...n\) with overall ranks \(\hat{r}_1,....\hat{r}_{n}\). Each individual belongs to group \(g=1,...G\), and each group \(g\) has \(n_{g}\) individuals with average rank \(\bar{r}_{g} = \sum_{i} \hat{r}_{i} /n_{g}\). The Kruskal Wallis statistic is \[\begin{eqnarray} \hat{KW} &=& (N-1) \frac{\sum_{g=1}^{G} n_{g}( \bar{r}_{g} - \bar{r} )^2 }{\sum_{i=1}^{n} ( \hat{r}_{i} - \bar{r} )^2}, \end{eqnarray}\] where \(\bar{r} = \frac{n+1}{2}\) is the grand mean rank.

In the special case with only two groups, \(G=2\), the Kruskal Wallis test reduces to the Mann–Whitney U test (also known as the Wilcoxon rank-sum test). In this case, we can write the hypotheses in terms of individual outcomes in each group, \(Y_i\) in one group \(Y_j\) in the other; \(H_0: Prob(Y_i > Y_j)=Prob(Y_i > Y_i)\) versus \(H_A: Prob(Y_i > Y_j) \neq Prob(Y_i > Y_j)\). The corresponding test statistic is \[\begin{eqnarray} \hat{U} &=& \min(\hat{U}_1, \hat{U}_2) \\ \hat{U}_g &=& \sum_{i\in g}\sum_{j\in -g} \Bigl[\mathbf 1( \hat{Y}_{i} > \hat{Y}_{j}) + \tfrac12\mathbf 1(\hat{Y}_{i} = \hat{Y}_{j})\Bigr]. \end{eqnarray}\]

Code
library(AER)
data(CASchools)
CASchools[,'stratio'] <- CASchools[,'students']/CASchools[,'teachers']

# Do student/teacher ratio differ for at least 1 county?
# Single test of multiple distributions
kruskal.test(CASchools[,'stratio'], CASchools[,'county'])
## 
##  Kruskal-Wallis rank sum test
## 
## data:  CASchools[, "stratio"] and CASchools[, "county"]
## Kruskal-Wallis chi-squared = 161.18, df = 44, p-value = 2.831e-15

# Multiple pairwise tests
# pairwise.wilcox.test(CASchools[,'stratio'], CASchools[,'county'])

11.4 Further Reading

Other Statistics


  1. Let each group \(g\) have median \(\tilde{M}_{g}\), interquartile range \(\hat{IQR}_{g}\), observations \(n_{g}\). We can compute standard deviation of the median as \(\tilde{S}_{g}= \frac{1.25 \hat{IQR}_{g}}{1.35 \sqrt{n_{g}}}\). As a rough guess, the interval \(\tilde{M}_{g} \pm 1.7 \tilde{S}_{g}\) is the historical default and displayed as a notch in the boxplot. See also https://www.tandfonline.com/doi/abs/10.1080/00031305.1978.10479236.↩︎