For mixed data, \(\hat{Y}_{i}\) is a cardinal variable and \(\hat{X}_{i}\) is a factor variable (typically unordered). For such data, we analyze associations via group comparisons. The basic idea is best seen in a comparison of two samples, which corresponds to an \(\hat{X}_{i}\) with two categories. For example, the heights of men and women in Canada or the homicide rates in two different American states. For another example, the wages for people with and without completing a degree.
There may be several differences between these samples. Often, the first summary statistic we investigate is the difference in means.
Mean Differences.
We often want to know if the means of different sample are different. To test this hypothesis, we compute the means separately for each sample and then examine the differences term \[\begin{eqnarray}
\hat{D} = \hat{M}_{Y1} - \hat{M}_{Y2},
\end{eqnarray}\] with a null hypothesis that there is no difference in the population means.
Just as with one sample tests, we can compute a standardized differences, where \(D\) is converted into a \(t\) statistic. Note, however, that we have to compute the standard error for the difference statistic, which is a bit more complicated. However, this allows us to easily conduct one or two sided hypothesis tests using a standard normal approximation.
Tip
Code
se_hat <-sqrt(var(Y1)/n1 +var(Y2)/n2);t_obs <- d/se_hatt_2sample <-function(Y1, Y2){# Differences between means m1 <-mean(Y1) m2 <-mean(Y2) d <- (m1-m2)# SE estimate n1 <-length(Y1) n2 <-length(Y2) s1 <-var(Y1) s2 <-var(Y2) s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2) d_se <-sqrt(s*(1/n1+1/n2))# t stat t_stat <- d/d_sereturn(t_stat)}tstat <-twosam(data[,'male'], data[,'female'])tstat
Quantile Differences.
The above procedure generalized from differences in means to other quantiles statistics like medians.
# Bootstrap Distribution Functionboot_quant <-function(Y1, Y2, B=9999, prob=0.5, ...){ bootstrap_diff <-vector(length=B)for(b inseq(bootstrap_diff)){ Y1_b <-sample(Y1, replace=T) Y2_b <-sample(Y2, replace=T) q1_b <-quantile(Y1_b, probs=0.5, ...) q2_b <-quantile(Y2_b, probs=0.5, ...) d_b <- q1_b - q2_b bootstrap_diff[b] <- d_b }return(bootstrap_diff)}# 2-Sided Test for Median Differences# d <- median(Y2) - median(Y1)boot_d <-boot_quant(Y1, Y1, B=999, prob=0.5)hist(boot_d, border=NA, font.main=1,main='Difference in Medians')abline(v=quantile(boot_d, probs=c(.025, .975)), lwd=2)abline(v=0, lwd=2, col=2)
Code
1-ecdf(boot_d)(0)## [1] 0.4154154
If we want to test for the differences in medians across groups with independent observations, we can also use notches in the boxplot. If the notches of two boxes do not overlap, then there is rough evidence that the difference in medians is statistically significant. The square root of the sample size is also shown as the bin width in each boxplot.1
Code
boxplot(Y1, Y2,col=cols,notch=T,varwidth=T)
Note that bootstrap tests can perform poorly with highly unequal variances or skewed data. To see this yourself, make a simulation with skewed data and unequal variances.
In principle, we can also examine whether there are differences in spread or shape statistics such as sd and IQR, or skew and kurtosis. More often, however, we examine whether there are any differences in the distributions.
Tip
Here is an example to look at differences in “spread”
We can also examine whether there are any differences between the entire distributions. Often, we simply plot the two CDF’s. Another useful visualization is to plot the quantiles against one another: a quantile-quantile plot. I.e., the first data point on the bottom left shows the first quantile for both distributions.
The starting point for hypothesis testing is the Kolmogorov-Smirnov Statistic: the maximum absolute difference between two CDF’s over all sample data \(y \in \{Y_1\} \cup \{Y_2\}\). \[\begin{eqnarray}
\hat{KS} &=& \max_{y} |\hat{F}_{1}(y)- \hat{F}_{2}(y)|^{p},
\end{eqnarray}\] where \(p\) is an integer (typically 1). An intuitive alternative is the Cramer-von Mises Statistic: the sum of absolute differences (raised to an integer, typically 2) between two CDF’s. \[\begin{eqnarray}
\hat{CVM} &=& \sum_{y} | \hat{F}_{1}(y)- \hat{F}_{2}(y)|^{p}.
\end{eqnarray}\]
Just as before, you use bootstrapping for hypothesis testing.
Code
twosamples::ks_test(Y1, Y2)## Test Stat P-Value ## 0.2892157 0.1015000twosamples::cvm_test(Y1, Y2)## Test Stat P-Value ## 2.084253 0.081000
Tip
Compare the distribution of arrests for two different counties, each with data over time.
Code
library(wooldridge)countymurders
Comparing Multiple Groups.
For multiple groups, we can tests the equality of all distributions (whether at least one group is different). The Kruskal-Wallis test examines \(H_0:\; F_1 = F_2 = \dots = F_G\) versus \(H_A:\; \text{at least one } F_g \text{ differs}\), where \(F_g\) is the continuous distribution of group \(g=1,...G\). This test does not tell us which group is different.
To conduct the test, first denote individuals \(i=1,...n\) with overall ranks \(\hat{r}_1,....\hat{r}_{n}\). Each individual belongs to group \(g=1,...G\), and each group \(g\) has \(n_{g}\) individuals with average rank \(\bar{r}_{g} = \sum_{i} \hat{r}_{i} /n_{g}\). The Kruskal Wallis statistic is \[\begin{eqnarray}
\hat{KW} &=& (N-1) \frac{\sum_{g=1}^{G} n_{g}( \bar{r}_{g} - \bar{r} )^2 }{\sum_{i=1}^{n} ( \hat{r}_{i} - \bar{r} )^2},
\end{eqnarray}\] where \(\bar{r} = \frac{n+1}{2}\) is the grand mean rank.
In the special case with only two groups, \(G=2\), the Kruskal Wallis test reduces to the Mann–Whitney U test (also known as the Wilcoxon rank-sum test). In this case, we can write the hypotheses in terms of individual outcomes in each group, \(Y_i\) in one group \(Y_j\) in the other; \(H_0: Prob(Y_i > Y_j)=Prob(Y_i > Y_i)\) versus \(H_A: Prob(Y_i > Y_j) \neq Prob(Y_i > Y_j)\). The corresponding test statistic is \[\begin{eqnarray}
\hat{U} &=& \min(\hat{U}_1, \hat{U}_2) \\
\hat{U}_g &=& \sum_{i\in g}\sum_{j\in -g}
\Bigl[\mathbf 1( \hat{Y}_{i} > \hat{Y}_{j}) + \tfrac12\mathbf 1(\hat{Y}_{i} = \hat{Y}_{j})\Bigr].
\end{eqnarray}\]
Code
library(AER)data(CASchools)CASchools[,'stratio'] <- CASchools[,'students']/CASchools[,'teachers']# Do student/teacher ratio differ for at least 1 county?# Single test of multiple distributionskruskal.test(CASchools[,'stratio'], CASchools[,'county'])## ## Kruskal-Wallis rank sum test## ## data: CASchools[, "stratio"] and CASchools[, "county"]## Kruskal-Wallis chi-squared = 161.18, df = 44, p-value = 2.831e-15# Multiple pairwise tests# pairwise.wilcox.test(CASchools[,'stratio'], CASchools[,'county'])
12.3 Theory
Under the assumption that both populations are normally distributed, we can analytically derive the sampling distribution for the difference-in-means. In particular, the \(t\)-statistic \[\begin{eqnarray}
\hat{t} = \frac{
\hat{M}_{Y1} - \hat{M}_{Y2}
}{
\sqrt{\hat{S}_{Y1}+\hat{S}_{Y2}}/\sqrt{n}
}
\end{eqnarray}\] follows Student’s t-distribution. Welch’s \(t\)-statistic is an adjustment for two normally distributed populations with potentially unequal variances or sample sizes.
t.test(Y1, Y2, var.equal=F)## ## Welch Two Sample t-test## ## data: Y1 and Y2## t = -4.1201, df = 163.58, p-value = 6.001e-05## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -1.2702451 -0.4471747## sample estimates:## mean of x mean of y ## 0.1194699 0.9781798
Let each group \(g\) have median \(\tilde{M}_{g}\), interquartile range \(\hat{IQR}_{g}\), observations \(n_{g}\). We can compute standard deviation of the median as \(\tilde{S}_{g}= \frac{1.25 \hat{IQR}_{g}}{1.35 \sqrt{n_{g}}}\). As a rough guess, the interval \(\tilde{M}_{g} \pm 1.7 \tilde{S}_{g}\) is the historical default and displayed as a notch in the boxplot. See also https://www.tandfonline.com/doi/abs/10.1080/00031305.1978.10479236.↩︎