10 Bivariate Distributions

We will now study two variables. The data for each observation data can be grouped together as a vector \((\hat{X}_{i}, \hat{Y}_{i})\).

Code

# Bivariate Data from USArrests
xy <- USArrests[,c('Murder','UrbanPop')]
xy[1,]
##         Murder UrbanPop
## Alabama   13.2       58

10.1 Joint Distributions

Discrete Data.

We have multiple observations of \((\hat{X}_{i}, \hat{Y}_{i})\), each of which corresponds to a unique value \((x,y)\). A frequency table shows how often each combination of values appear. The joint distribution counts up the number of pairs with the same values and divides by the number of obersvations, \(n\); \[\begin{eqnarray} \hat{p}_{xy} = \sum_{i=1}^{n}\mathbf{1}\left( \hat{X}_{i}=x, \hat{Y}_{i}=y \right)/n, \end{eqnarray}\] The marginal distributions are just the univariate information, which can be computed independently (see here ) and also from the joint distribution \[\begin{eqnarray} \hat{p}_{x} = \sum_{y} \hat{p}_{xy} \end{eqnarray}\]

For example, suppose we observe a sample of \(n=13\) students with two discrete variables:

\(X_{i}\) depicts years of education, taking values in \(\{12, 14\}\)
\(Y_{i}\) depicts sex, where \(y=1\) means female and \(y=0\) means male

Assume the count data are summarized by the following table:

\[\begin{array}{c|cc|c} & y=0\ (\text{male}) & y=1\ (\text{female})\\ \hline x=12 & 4 & 3 \\ x=14 & 1 & 5 \\ \hline \end{array}\] The joint distributions divides each cell count by the total number of observations to obtain \[\begin{array}{c|cc} & y=0 & y=1\\ \hline x=12 & 4/13 & 3/13 \\ x=14 & 1/13 & 5/13 \end{array}\]

The marginal distribution of \(X_{i}\) is \[\begin{eqnarray} \hat{p}_{x=12} &=& \frac{4}{13} + \frac{3}{13} = \frac{7}{13},\\ \hat{p}_{x=14} &=& \frac{1}{13} + \frac{5}{13} = \frac{6}{13}. \end{eqnarray}\] The marginal distribution of \(Y_{i}\) is \[\begin{eqnarray} \hat{p}_{y=0} &=& \frac{4}{13} + \frac{1}{13} = \frac{5}{13} \approx 0.38,\\ \hat{p}_{y=1} &=& \frac{3}{13} + \frac{5}{13} = \frac{8}{13} \approx 0.62. \end{eqnarray}\]

Together, this yields a frequency table with marginals \[\begin{array}{c|cc|c} & y=0\ (\text{male}) & y=1\ (\text{female}) & \text{Row total}\\ \hline x=12 & 4/13 & 3/13 & 7/13\\ x=14 & 1/13 & 5/13 & 6/13\\ \hline \text{Column total} & 5/13 & 8/13 & 1 \end{array}\]

Code

# Simple hand-built dataset: 13 students
dat <- rbind(
    c(12, 0), c(12, 0), c(12, 0), c(12, 0), #Male
    c(14, 0), 
    c(12, 1), c(12, 1), c(12, 1),
    c(14, 1), c(14, 1), c(14, 1), c(14, 1), c(14, 1))
colnames(dat) <- c('educ', 'female')
dat <- as.data.frame(dat)

# Frequency table
tab  <- table(dat)          # counts
prop <- tab/sum(tab)        # joint distribution p_xy

# joint distribution with marginals added
prop_m <- addmargins(prop)
round(prop_m,2)
##      female
## educ     0    1  Sum
##   12  0.31 0.23 0.54
##   14  0.08 0.38 0.46
##   Sum 0.38 0.62 1.00

Note that you can compute various univariate statistics using the marginal distributions. E.g., the mean of \(\hat{X}_{i}\) is \(\hat{M}_{X}=\sum_{x} x \hat{p}_{x}\). In the the above example, that equals \(12 \frac{7}{13} + 14\frac{6}{13} \approx 12.9\). Try computing the standard deviation \(\hat{S}_{X}\) yourself.

Code

# marginals and univariate statistics
p_x <- margin.table(prop, 1)
sum(p_x* as.numeric(names(p_x)) )
## [1] 12.92308
mean(dat[,1])
## [1] 12.92308

# p_y <- margin.table(prop, 2)

Tip

Summarize the distribution of sex and education attainment for highschool students

Code

library(wooldridge)

# Highschool Students, Sex and Education Attainment
dat <- wage1[,c('educ','female')]
dat <- dat[ dat[,'educ'] >=10,]

Try it yourself before looking at this answer

Code

tab <- table(dat) ## counts
tab
##     female
## educ   0   1
##   10  13  17
##   11  17  12
##   12  85 113
##   13  14  25
##   14  31  22
##   15  12   9
##   16  45  23
##   17  10   2
##   18  13   6

prop <- tab/sum(tab) ## frequencies
round(prop,3)
##     female
## educ     0     1
##   10 0.028 0.036
##   11 0.036 0.026
##   12 0.181 0.241
##   13 0.030 0.053
##   14 0.066 0.047
##   15 0.026 0.019
##   16 0.096 0.049
##   17 0.021 0.004
##   18 0.028 0.013

prop_full <- addmargins(prop) ## column and row sums
round(prop_full,3)
##      female
## educ      0     1   Sum
##   10  0.028 0.036 0.064
##   11  0.036 0.026 0.062
##   12  0.181 0.241 0.422
##   13  0.030 0.053 0.083
##   14  0.066 0.047 0.113
##   15  0.026 0.019 0.045
##   16  0.096 0.049 0.145
##   17  0.021 0.004 0.026
##   18  0.028 0.013 0.041
##   Sum 0.512 0.488 1.000

Continuous Data.

Scatterplots are used frequently to summarizes the joint distribution of continuous data. They can be enhanced in several ways. As a default, use semi-transparent points so as not to hide any points (and perhaps see if your observations are concentrated anywhere). You can also add other features that help summarize the relationship, although I will defer this until later.

Code

plot(Murder~UrbanPop, USArrests, pch=16, col=grey(0.,.5))

You can also show the marginal distributions of each variable along each axis.

Code

# Setup Plot
layout( matrix(c(2,0,1,3), ncol=2, byrow=TRUE),
    widths=c(9/10,1/10), heights=c(1/10,9/10))

# Scatterplot
par(mar=c(4,4,1,1))
plot(Murder~UrbanPop, USArrests, pch=16, col=rgb(0,0,0,.5))

# Add Marginals
par(mar=c(0,4,1,1))
xhist <- hist(USArrests[,'UrbanPop'], plot=FALSE)
barplot(xhist[['counts']], axes=FALSE, space=0, border=NA)

par(mar=c(4,0,1,1))
yhist <- hist(USArrests[,'Murder'], plot=FALSE)
barplot(yhist[['counts']], axes=FALSE, space=0, horiz=TRUE, border=NA)

10.2 Conditional Distributions

Discrete Data.

In contrast to marginal and joint distributions, conditional distributions describe how the distribution of one variable changes when we restrict attention to a subgroup defined by another variable. Formally, if \(\hat{p}_{xy}\) denotes the empirical joint distribution, then the empirical conditional distribution of \(Y_{i}\) given \(X_{i}=x\) is \[\begin{eqnarray} \hat{p}_{y \mid x} = \frac{\hat{p}_{xy}}{\hat{p}_{x} }, \end{eqnarray}\] where \(\hat{p}_{x} = \sum_y \hat{p}_{xy}\).

For example, suppose we observe a sample of \(n=13\) students with two discrete variables:

\(X_{i}\) depicts years of education, taking values in \(\{12, 14\}\)
\(Y_{i}\) depicts sex, where \(y=1\) means female and \(y=0\) means male

Assume the data are summarized by the following joint distribution table:

\[\begin{array}{c|cc|c} & y=0\ (\text{male}) & y=1\ (\text{female}) & \text{Row total}\\ \hline x=12 & 4/13 & 3/13 & 7/13\\ x=14 & 1/13 & 5/13 & 6/13\\ \hline \text{Column total} & 5/13 & 7/13 & 1 \end{array}\] To compute the conditional distribution of \(Y_{i}\) given \(X_{i}=x\), we rescale the numbers in each row so that each row sums to 1. Specifically, for \(x=12\), we compute \[\begin{eqnarray} \hat{p}_{y=0 \mid x=12} &=& \frac{\hat{p}_{x=12, y=0}}{\hat{p}_{x=12}} = \frac{4/13}{7/13} = \frac{4}{7} \\ \hat{p}_{y=1\mid x=12} &=& \frac{\hat p_{x=12, y=1}}{\hat{p}_{x=12}} = \frac{3/13}{7/13} = \frac{3}{7} \end{eqnarray}\] Similarly, for \(x=14\), we compute \[\begin{eqnarray} \hat{p}_{y=0 \mid x=14} &=& \frac{\hat{p}_{x=14, y=0}}{\hat{p}_{x=14}} = \frac{1/13}{6/13} = \frac{1}{6}. \\ \hat{p}_{y=1\mid x=14} &=& \frac{\hat p_{x=14, y=1}}{\hat{p}_{x=14}} = \frac{5/13}{6/13} = \frac{5}{6}. \end{eqnarray}\] Together, we find \[\begin{array}{c|cc|c} & y=0\ (\text{male}) & y=1\ (\text{female}) & \text{Row total}\\ \hline x=12 & 4/7 & 3/7 & 7/7\\ x=14 & 1/6 & 5/6 & 6/6\\ \hline \end{array}\]

In this example, we say that

conditional on students having \(12\) years of education, \(\frac{4}{7}\approx 57\%\) are male and \(\frac{3}{7}\approx 43\%\) are female.
conditional on students having \(14\) years of education, \(\frac{1}{6} \approx 17\%\) are male and \(\frac{5}{6} \approx 83\%\) are female.

We can also compute the conditional distribution of \(X_{i}\) given \(Y_{i}=y\), \(\hat{p}_{x \mid y} = \hat{p}_{xy} / \hat{p}_{y}\),just as we did above. In this case, we rescale the numbers in each column of the joint distribution table so that each column sums to 1.

Tip

Continuing the example above, show that

among male students, \(\approx 80\%\) have 12 years of education.
among female students, \(\approx 38\%\) have 12 years of education.

Try programming the results, especially if you are stuck or uncertain

Code

dat <- rbind(
    c(12, 0), c(12, 0), c(12, 0), c(12, 0), #Male
    c(14, 0), 
    c(12, 1), c(12, 1), c(12, 1),
    c(14, 1), c(14, 1), c(14, 1), c(14, 1), c(14, 1))
colnames(dat) <- c('educ', 'female')
dat <- as.data.frame(dat)

tab <- table(dat)
tab
##     female
## educ 0 1
##   12 4 3
##   14 1 5

# Joint distribution
round(prop.table(tab), 3)
##     female
## educ     0     1
##   12 0.308 0.231
##   14 0.077 0.385

# Conditional distribution of Y given X
round(prop.table(tab, 1), 3)
##     female
## educ     0     1
##   12 0.571 0.429
##   14 0.167 0.833

# Conditional distribution of X given Y
round(prop.table(tab, 2), 3)
##     female
## educ     0     1
##   12 0.800 0.375
##   14 0.200 0.625

Conditional distributions change the unit of analysis:

\(\hat{p}_{y=1 \mid x=14}\) answers “what fraction of student are female within the subgroup with 14 years of education?”
\(\hat{p}_{x=14 \mid y=1}\) answers “what fraction of students have 14 years of education within the subgroup that is female?”

Continuous Data.

These describe the relationship between \(\hat{Y}_{i}\) and \(\hat{X}_{i}\). We show how \(Y\) changes according to \(X\) using a histogram or bar plot. When \(X\) is continuous, as it often is, we split it into distinct bins and convert it to a factor variable. E.g.,

Code

# Split Data by Urban Population above/below mean
pop_mean <- mean(USArrests[,'UrbanPop'])
pop_cut <- USArrests[,'UrbanPop'] <= pop_mean
murder_lowpop <- USArrests[pop_cut,'Murder']
murder_highpop <- USArrests[!pop_cut,'Murder']
cols <- c(low=rgb(0,0,1,.75), high=rgb(1,0,0,.75))

# Common Histogram 
ylim <- c(0,.25)
xbks <-  seq(min(USArrests[,'Murder'])-1, max(USArrests[,'Murder'])+1, by=1)

par(mfrow=c(1,2))
hist(murder_lowpop,
    breaks=xbks, col=cols[1],
    main='Urban Pop <= Mean', font.main=1,
    xlab='Murder Arrests', freq=F,
    border=NA, ylim=ylim)

hist(murder_highpop,
    breaks=xbks, col=cols[2],
    main='Urban Pop > Mean', font.main=1,
    xlab='Murder Arrests', freq=F,
    border=NA, ylim=ylim)

It is sometimes it is preferable to show the ECDF instead. And you can glue various combinations together to convey more information all at once

Code

layout( t(c(1,2,2)))
# Full Sample Density
hist(USArrests[,'Murder'], 
    main='Full Sample Density', font.main=1,
    xlab='Murder Arrests',
    breaks=xbks, freq=F, border=NA)

# Split Sample Distribution Comparison
F_lowpop <- ecdf(murder_lowpop)
plot(F_lowpop, col=cols[1],
    pch=16, xlab='Murder Arrests',
    main='Split Sample Distributions',
    font.main=1, bty='n')
F_highpop <- ecdf(murder_highpop)
plot(F_highpop, add=T, col=cols[2], pch=16)

legend('bottomright', col=cols,
    pch=16, bty='n', inset=c(0,.1),
    title='% Urban Pop.',
    legend=c('Low (<= Mean)','High (> Mean)'))

Code

# Simple Interactive Scatter Plot
# plot(Assault~UrbanPop, USArrests, col=grey(0,.5), pch=16,
#    cex=USArrests[,'Murder']/diff(range(USArrests[,'Murder']))*2,
#    main='US Murder arrests (per 100,000)')

You can also split data into more than two groups. For more than three groups, boxplots are often more effective than histograms or ECDF’s.

Code

# K Groups with even spacing (not even counts)
K <- 4
USArrests[,'UrbanPop_Kcut'] <- cut(USArrests[,'UrbanPop'],K)
table(USArrests[,'UrbanPop_Kcut'] )
## 
## (31.9,46.8] (46.8,61.5] (61.5,76.2] (76.2,91.1] 
##           6          13          17          14

# Full sample
#boxplot(USArrests[,'Murder'], main='',
#    xlab='All Data', ylab='Murder Arrests')

# Boxplots for each group
Kcols <- hcl.colors(K,alpha=.5)
boxplot(Murder~UrbanPop_Kcut, USArrests,
    main='', col=Kcols, 
    varwidth=T, #show number of obs. per group
    xlab='Urban Population', ylab='')

Code


# 4 Groups with equal numbers of observations
#Qcuts <- c(
#    '0%'=min(USArrests[,'UrbanPop'])-10*.Machine[['double.eps']],
#    quantile(USArrests[,'UrbanPop'], probs=c(.25,.5,.75,1)))
#USArrests[,'UrbanPop']_cut <- cut(USArrests[,'UrbanPop'], Qcuts)
#boxplot(Murder~UrbanPop_cut, USArrests, col=hcl.colors(4,alpha=.5))

10.3 Simpson’s Paradox

Conditional distributions can be tricky, so take your time making and interpreting them.

Consider the following example, which shows that \(400\) men and \(400\) women applied to university and that only a share were accepted in each department (English or Engineering). If you computed percents for each sex within each cell, you would find men have a higher overall acceptance rate, but women have higher acceptance rates within each department! This counterintuitive phenomena is known as Simpson’s Paradox, and the example is part of a real debate about discrimination (http://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf).

Even if women have higher admission rates within both departments, women can still have a lower overall admission rate if women disproportionately apply to the more selective department (English) and men disproportionately apply to the less selective department (Engineering). The same issue is relevant for a variety of labor market issues, such as the gender pay gap, and a great many other social issues, such as why are some countries rich and others poor.¹

School Applicants (Admitted), by Sex and Department
Department	Men	Women	Total
English	100 (40, \(40\%\))	350 (150, \(43\%\))	450 (190, \(42\%\))
Engineering	300 (160, \(53\%\))	50 (30, \(60\%\))	350 (190, \(54\%\))
Total	400 (200, \(50\%\))	400 (180, \(45\%\))	800 (380, \(48\%\))

The same issue shows up in continuous data. Here is an example where the overall relationship differs from the relationship within each group.

10.4 Further Reading

For plotting histograms and marginal distributions, see

In general, one should also remember that a ratio can change due changes in either the numerator or the denominator.↩︎