Code
# Bivariate Data from USArrests
xy <- USArrests[, c('Murder', 'UrbanPop')]
xy[1, ]
## Murder UrbanPop
## Alabama 13.2 58We will now study two variables. The data for each observation data can be grouped together as a vector \((\hat{X}_{i}, \hat{Y}_{i})\).
# Bivariate Data from USArrests
xy <- USArrests[, c('Murder', 'UrbanPop')]
xy[1, ]
## Murder UrbanPop
## Alabama 13.2 58We have multiple observations of \((\hat{X}_{i}, \hat{Y}_{i})\), each of which corresponds to a unique value \((x,y)\). A frequency table shows how often each combination of values appear. The joint distribution counts up the number of pairs with the same values and divides by the number of observations, \(n\); \[\begin{eqnarray} \hat{p}_{xy} = \sum_{i=1}^{n}\mathbf{1}\left( \hat{X}_{i}=x, \hat{Y}_{i}=y \right)/n, \end{eqnarray}\] The marginal distributions are just the univariate information, which can be computed independently (see here ) and also from the joint distribution \[\begin{eqnarray} \hat{p}_{x} = \sum_{y} \hat{p}_{xy} \end{eqnarray}\]
For example, suppose we observe a sample of \(n=13\) students with two discrete variables:
Assume the count data are summarized by the following table:
\[\begin{array}{c|cc|c} & y=0\ (\text{male}) & y=1\ (\text{female})\\ \hline x=12 & 4 & 3 \\ x=14 & 1 & 5 \\ \hline \end{array}\] The joint distributions divides each cell count by the total number of observations to obtain \[\begin{array}{c|cc} & y=0 & y=1\\ \hline x=12 & 4/13 & 3/13 \\ x=14 & 1/13 & 5/13 \end{array}\]The marginal distribution of \(X_{i}\) is \[\begin{eqnarray} \hat{p}_{x=12} &=& \frac{4}{13} + \frac{3}{13} = \frac{7}{13},\\ \hat{p}_{x=14} &=& \frac{1}{13} + \frac{5}{13} = \frac{6}{13}. \end{eqnarray}\] The marginal distribution of \(Y_{i}\) is \[\begin{eqnarray} \hat{p}_{y=0} &=& \frac{4}{13} + \frac{1}{13} = \frac{5}{13} \approx 0.38,\\ \hat{p}_{y=1} &=& \frac{3}{13} + \frac{5}{13} = \frac{8}{13} \approx 0.62. \end{eqnarray}\]
Together, this yields a frequency table with marginals \[\begin{array}{c|cc|c} & y=0\ (\text{male}) & y=1\ (\text{female}) & \text{Row total}\\ \hline x=12 & 4/13 & 3/13 & 7/13\\ x=14 & 1/13 & 5/13 & 6/13\\ \hline \text{Column total} & 5/13 & 8/13 & 1 \end{array}\]# Simple hand-built dataset: 13 students
dat <- rbind(
c(12, 0), c(12, 0), c(12, 0), c(12, 0), #Male
c(14, 0),
c(12, 1), c(12, 1), c(12, 1),
c(14, 1), c(14, 1), c(14, 1), c(14, 1), c(14, 1))
colnames(dat) <- c('educ', 'female')
dat <- as.data.frame(dat)
# Frequency table
tab <- table(dat) # counts
prop <- tab/sum(tab) # joint distribution p_xy
# joint distribution with marginals added
prop_m <- addmargins(prop)
round(prop_m, 2)
## female
## educ 0 1 Sum
## 12 0.31 0.23 0.54
## 14 0.08 0.38 0.46
## Sum 0.38 0.62 1.00Note that you can compute various univariate statistics using the marginal distributions. E.g., the mean of \(\hat{X}_{i}\) is \(\hat{M}_{X}=\sum_{x} x \hat{p}_{x}\). In the the above example, that equals \(12 \frac{7}{13} + 14\frac{6}{13} \approx 12.9\). Try computing the standard deviation \(\hat{S}_{X}\) yourself.
# marginals and univariate statistics
p_x <- margin.table(prop, 1)
sum(p_x* as.numeric(names(p_x)) )
## [1] 12.92308
mean(dat[, 1])
## [1] 12.92308
# p_y <- margin.table(prop, 2)Scatterplots are used frequently to summarizes the joint distribution of continuous data. They can be enhanced in several ways. As a default, use semi-transparent points so as not to hide any points (and perhaps see if your observations are concentrated anywhere). You can also add other features that help summarize the relationship, although I will defer this until later.
plot(Murder ~ UrbanPop, USArrests, pch=16, col=grey(0, .5),
main=NA, xlab='Urban Population', ylab='Murder Arrests')
You can also show the marginal distributions of each variable along each axis (see this margin and layout cheatsheet for how layout(), mar, and oma work together).
# Setup Plot
layout( matrix(c(2, 0, 1, 3), ncol=2, byrow=TRUE),
widths=c(9/10, 1/10), heights=c(1/10, 9/10))
# Scatterplot
par(mar=c(4, 4, 1, 1))
plot(Murder ~ UrbanPop, USArrests, pch=16, col=grey(0, .5),
main=NA, xlab='Urban Population', ylab='Murder Arrests')
# Add Marginals
par(mar=c(0, 4, 1, 1))
xhist <- hist(USArrests[, 'UrbanPop'], plot=FALSE)
barplot(xhist[['counts']], axes=FALSE, space=0, border=NA)
par(mar=c(4, 0, 1, 1))
yhist <- hist(USArrests[, 'Murder'], plot=FALSE)
barplot(yhist[['counts']], axes=FALSE, space=0, horiz=TRUE, border=NA)
In contrast to marginal and joint distributions, conditional distributions describe how the distribution of one variable changes when we restrict attention to a subgroup defined by another variable. Formally, if \(\hat{p}_{xy}\) denotes the empirical joint distribution, then the empirical conditional distribution of \(Y_{i}\) given \(X_{i}=x\) is \[\begin{eqnarray} \hat{p}_{y \mid x} = \frac{\hat{p}_{xy}}{\hat{p}_{x} }, \end{eqnarray}\] where \(\hat{p}_{x} = \sum_y \hat{p}_{xy}\).
For example, suppose we observe a sample of \(n=13\) students with two discrete variables:
Assume the data are summarized by the following joint distribution table:
\[\begin{array}{c|cc|c} & y=0\ (\text{male}) & y=1\ (\text{female}) & \text{Row total}\\ \hline x=12 & 4/13 & 3/13 & 7/13\\ x=14 & 1/13 & 5/13 & 6/13\\ \hline \text{Column total} & 5/13 & 8/13 & 1 \end{array}\] To compute the conditional distribution of \(Y_{i}\) given \(X_{i}=x\), we rescale the numbers in each row so that each row sums to 1. Specifically, for \(x=12\), we compute \[\begin{eqnarray} \hat{p}_{y=0 \mid x=12} &=& \frac{\hat{p}_{x=12, y=0}}{\hat{p}_{x=12}} = \frac{4/13}{7/13} = \frac{4}{7} \\ \hat{p}_{y=1\mid x=12} &=& \frac{\hat p_{x=12, y=1}}{\hat{p}_{x=12}} = \frac{3/13}{7/13} = \frac{3}{7} \end{eqnarray}\] Similarly, for \(x=14\), we compute \[\begin{eqnarray} \hat{p}_{y=0 \mid x=14} &=& \frac{\hat{p}_{x=14, y=0}}{\hat{p}_{x=14}} = \frac{1/13}{6/13} = \frac{1}{6}. \\ \hat{p}_{y=1\mid x=14} &=& \frac{\hat p_{x=14, y=1}}{\hat{p}_{x=14}} = \frac{5/13}{6/13} = \frac{5}{6}. \end{eqnarray}\] Together, we find \[\begin{array}{c|cc|c} & y=0\ (\text{male}) & y=1\ (\text{female}) & \text{Row total}\\ \hline x=12 & 4/7 & 3/7 & 7/7\\ x=14 & 1/6 & 5/6 & 6/6\\ \hline \end{array}\]In this example, we say that
We can also compute the conditional distribution of \(X_{i}\) given \(Y_{i}=y\), \(\hat{p}_{x \mid y} = \hat{p}_{xy} / \hat{p}_{y}\),just as we did above. In this case, we rescale the numbers in each column of the joint distribution table so that each column sums to 1.
Conditional distributions change the unit of analysis:
These describe the relationship between \(\hat{Y}_{i}\) and \(\hat{X}_{i}\). We show how \(Y\) changes according to \(X\) using a histogram or bar plot. When \(X\) is continuous, as it often is, we split it into distinct bins and convert it to a factor variable. E.g.,
# Split Data by Urban Population above/below mean
pop_mean <- mean(USArrests[, 'UrbanPop'])
pop_cut <- USArrests[, 'UrbanPop'] <= pop_mean
murder_lowpop <- USArrests[pop_cut, 'Murder']
murder_highpop <- USArrests[!pop_cut, 'Murder']
cols <- c(low=rgb(0, 0, 1, .75), high=rgb(1, 0, 0, .75))
# Common Histogram
ylim <- c(0, .25)
xbks <- seq(min(USArrests[, 'Murder'])-1, max(USArrests[, 'Murder'])+1, by=1)
par(mfrow=c(1, 2))
hist(murder_lowpop,
breaks=xbks, col=cols[1],
main=NA,
xlab='Murder Arrests', freq=F,
border=NA, ylim=ylim)
title('Urban Pop <= Mean', font.main=1)
hist(murder_highpop,
breaks=xbks, col=cols[2],
main=NA,
xlab='Murder Arrests', freq=F,
border=NA, ylim=ylim)
title('Urban Pop > Mean', font.main=1)
It is sometimes preferable to show the ECDF instead. And you can glue various combinations together to convey more information all at once
layout( t(c(1, 2, 2)))
# Full Sample Density
hist(USArrests[, 'Murder'],
main=NA,
xlab='Murder Arrests',
breaks=xbks, freq=F, border=NA)
title('Full Sample Density', font.main=1)
# Split Sample Distribution Comparison
F_lowpop <- ecdf(murder_lowpop)
plot(F_lowpop, col=cols[1],
pch=16, xlab='Murder Arrests',
main=NA, bty='n')
title('Split Sample Distributions', font.main=1)
F_highpop <- ecdf(murder_highpop)
plot(F_highpop, add=T, col=cols[2], pch=16)
legend('bottomright', col=cols,
pch=16, bty='n', inset=c(0, .1),
title='% Urban Pop.',
legend=c('Low (<= Mean)', 'High (> Mean)'))
# Simple Interactive Scatter Plot
# plot(Assault ~ UrbanPop, USArrests, col=grey(0, .5), pch=16,
# cex=USArrests[, 'Murder']/diff(range(USArrests[, 'Murder']))*2,
# main='US Murder arrests (per 100, 000)')You can also split data into more than two groups. For more than three groups, boxplots are often more effective than histograms or ECDF’s.
# K Groups with even spacing (not even counts)
K <- 4
USArrests[, 'UrbanPop_Kcut'] <- cut(USArrests[, 'UrbanPop'], K)
table(USArrests[, 'UrbanPop_Kcut'] )
##
## (31.9,46.8] (46.8,61.5] (61.5,76.2] (76.2,91.1]
## 6 13 17 14
# Full sample
#boxplot(USArrests[, 'Murder'], main='',
# xlab='All Data', ylab='Murder Arrests')
# Boxplots for each group
Kcols <- hcl.colors(K, alpha=.5)
boxplot(Murder ~ UrbanPop_Kcut, USArrests,
main=NA, col=Kcols, border=NA,
varwidth=T, #show number of obs. per group
xlab='Urban Population', ylab='Murder Arrests')
# 4 Groups with equal numbers of observations
#Qcuts <- c(
# '0%'=min(USArrests[, 'UrbanPop'])-10*.Machine[['double.eps']],
# quantile(USArrests[, 'UrbanPop'], probs=c(.25, .5, .75, 1)))
#USArrests[, 'UrbanPop_cut'] <- cut(USArrests[, 'UrbanPop'], Qcuts)
#boxplot(Murder ~ UrbanPop_cut, USArrests, col=hcl.colors(4, alpha=.5))Conditional distributions can be tricky, so take your time making and interpreting them.
Consider the following example, which shows that \(400\) men and \(400\) women applied to university and that only a share were accepted in each department (English or Engineering). If you computed percents for each sex within each cell, you would find men have a higher overall acceptance rate, but women have higher acceptance rates within each department! This counterintuitive phenomena is known as Simpson’s Paradox, and the example is part of a real debate about discrimination (http://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf).
Even if women have higher admission rates within both departments, women can still have a lower overall admission rate if women disproportionately apply to the more selective department (English) and men disproportionately apply to the less selective department (Engineering). The same issue is relevant for a variety of labor market issues, such as the gender pay gap, and a great many other social issues, such as why are some countries rich and others poor.1
| Department | Men | Women | Total |
|---|---|---|---|
| English | 100 (40, \(40\%\)) | 350 (150, \(43\%\)) | 450 (190, \(42\%\)) |
| Engineering | 300 (160, \(53\%\)) | 50 (30, \(60\%\)) | 350 (190, \(54\%\)) |
| Total | 400 (200, \(50\%\)) | 400 (180, \(45\%\)) | 800 (380, \(48\%\)) |
The same issue shows up in continuous data. Here is an example where the overall relationship differs from the relationship within each group.

Explain the difference between a joint distribution, a marginal distribution, and a conditional distribution. Using the education-and-sex example from this chapter, describe in words what \(\hat{p}_{y=1 \mid x=14}\) tells you and how it differs from \(\hat{p}_{x=14 \mid y=1}\).
Suppose you observe a sample of \(n=20\) people with two discrete variables: \(X_{i}\) takes values in \(\{A, B\}\) and \(Y_{i}\) takes values in \(\{1, 2\}\). The counts are: \((A,1)=6\), \((A,2)=4\), \((B,1)=3\), \((B,2)=7\). Compute the joint distribution \(\hat{p}_{xy}\), both marginal distributions, and the conditional distribution \(\hat{p}_{y \mid x}\).
Using the USArrests dataset, split the variable Assault into two groups based on whether UrbanPop is above or below its median. Plot overlapping histograms of Assault for both groups with semi-transparent colors. Then plot the ECDF for each group on the same axes with a legend.
In general, one should also remember that a ratio can change due changes in either the numerator or the denominator.↩︎