4 Data


4.1 Types

The two basic types of data are cardinal and factor data. We can further distinguish between whether cardinal data are discrete or continuous. We can also further distinguish between whether factor data are ordered or not

  • cardinal: the difference between elements always mean the same thing.
    • discrete: E.g., 2-1=3-2.
    • continuous: E.g., 2.11-1.4444=3.11-2.4444
  • factor: the difference between elements does not always mean the same thing.
    • ordered: E.g., First place - Second place ?? Second place - Third place.
    • unordered (categorical): E.g., A - B ????
d1d <- 1:3 # Cardinal data (Discrete)
d1d
## [1] 1 2 3
class(d1d)
## [1] "integer"
d1c <- c(1.1, 2/3, 3) # Cardinal data (Continuous)
d1c
## [1] 1.1000000 0.6666667 3.0000000
class(d1c)
## [1] "numeric"
d2o <- factor(c('A','B','C'), ordered=T) # Factor data (Ordinal)
d2o
## [1] A B C
## Levels: A < B < C
class(d2o)
## [1] "ordered" "factor"
d2c <- factor(c('Leipzig','Los Angeles','Logan'), ordered=F) # Factor data (Categorical)
d2c
## [1] Leipzig     Los Angeles Logan      
## Levels: Leipzig Logan Los Angeles
class(d2c)
## [1] "factor"

R also allows for more unstructured data types.

c('hello world', 'hi mom')  # character strings
## [1] "hello world" "hi mom"
list(d1c, d2c)  # lists
## [[1]]
## [1] 1.1000000 0.6666667 3.0000000
## 
## [[2]]
## [1] Leipzig     Los Angeles Logan      
## Levels: Leipzig Logan Los Angeles
list(d1c, c('hello world'),
    list(d1d, list('...inception...'))) # lists
## [[1]]
## [1] 1.1000000 0.6666667 3.0000000
## 
## [[2]]
## [1] "hello world"
## 
## [[3]]
## [[3]][[1]]
## [1] 1 2 3
## 
## [[3]][[2]]
## [[3]][[2]][[1]]
## [1] "...inception..."
# data.frames: your most common data type
    # matrix of different data-types
    # well-ordered lists
d0 <- data.frame(y=d1c, x=d2c)
d0
##           y           x
## 1 1.1000000     Leipzig
## 2 0.6666667 Los Angeles
## 3 3.0000000       Logan

Strings.

paste( 'hi', 'mom')
## [1] "hi mom"
paste( c('hi', 'mom'), collapse='--')
## [1] "hi--mom"
kingText <- "The king infringes the law on playing curling."
gsub(pattern="ing", replacement="", kingText)
## [1] "The k infres the law on play curl."
# advanced usage
#gsub("[aeiouy]", "_", kingText)
#gsub("([[:alpha:]]{3,})ing\\b", "\\1", kingText) 

See

Initial Inspection.

You typically begin by inspecting your data by examining the first few observations.

Consider, for example, historical data on crime in the US.

head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
# Check NA values
sum(is.na(x))
## [1] 0

To further examine a particular variable, we look at its distribution.

4.2 Empirical Distributions

In what follows, we will denote the data for a single variable as \(\{X_{i}\}_{i=1}^{N}\), where there are \(N\) observations and \(X_{i}\) is the value of the \(i\)th one.

Histogram. The histogram divides the range of \(\{X_{i}\}_{i=1}^{N}\) into \(L\) exclusive bins of equal-width \(h=[\text{max}(X_{i}) - \text{min}(X_{i})]/L\), and counts the number of observations within each bin. We often scale the counts to interpret the numbers as a density. Mathematically, for an exclusive bin with midpoint \(x\), we compute \[\begin{eqnarray} \widehat{f}_{HIST}(x) &=& \frac{ \sum_{i}^{N} \mathbf{1}\left( X_{i} \in \left[x-\frac{h}{2}, x+\frac{h}{2} \right) \right) }{N h}. \end{eqnarray}\] We compute \(\widehat{f}_{HIST}(x)\) for each \(x \in \left\{ \frac{\ell h}{2} + \text{min}(X) \right\}_{\ell=1}^{L}\).

hist(USArrests$Murder, freq=F,
    border=NA, main='', xlab='Murder Arrests')
# Raw Observations
rug(USArrests$Murder, col=grey(0,.5))

Note that if you your data is discrete, you can directly plot the counts. E.g.,

x <- floor(USArrests$Murder) #Discretized
plot(table(x), xlab='Murder Rate (Discrete)', ylab='Count')

Empirical Cumulative Distribution Function. The ECDF counts the proportion of observations whose values \(X_{i}\) are less than \(x\); \[\begin{eqnarray} \widehat{F}_{ECDF}(x) = \frac{1}{N} \sum_{i}^{N} \mathbf{1}(X_{i} \leq x) \end{eqnarray}\] for each unique value of \(x\) in the dataset.

F_murder <- ecdf(USArrests$Murder)
# proportion of murders < 10
F_murder(10)
## [1] 0.7
# proportion of murders < x, for all x
plot(F_murder, main='', xlab='Murder Arrests',
    pch=16, col=grey(0,.5))

Boxplots. Boxplots summarize the distribution of data using quantiles: the \(q\)th quantile is the value where \(q\) percent of the data are below and (\(1-q\)) percent are above.

  • The “median” is the point where half of the data has lower values and the other half has higher values.
  • The “lower quartile” is the point where 25% of the data has lower values and the other 75% has higher values.
  • The “min” is the smallest value (or largest negative value if there are any) where 0% of the data has lower values.
x <- USArrests$Murder

# quantiles
median(x)
## [1] 7.25
range(x)
## [1]  0.8 17.4
quantile(x, probs=c(0,.25,.5))
##    0%   25%   50% 
## 0.800 4.075 7.250
# deciles are quantiles
quantile(x, probs=seq(0,1, by=.1))
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##  0.80  2.56  3.38  4.75  6.00  7.25  8.62 10.12 12.12 13.32 17.40

To compute quantiles, we sort the observations from smallest to largest as \(X_{(1)}, X_{(2)},... X_{(N)}\), and then compute quantiles as \(X_{ (q*N) }\). Note that \((q*N)\) is rounded and there are different ways to break ties.

xo <- sort(x)
xo
##  [1]  0.8  2.1  2.1  2.2  2.2  2.6  2.6  2.7  3.2  3.3  3.4  3.8  4.0  4.3  4.4
## [16]  4.9  5.3  5.7  5.9  6.0  6.0  6.3  6.6  6.8  7.2  7.3  7.4  7.9  8.1  8.5
## [31]  8.8  9.0  9.0  9.7 10.0 10.4 11.1 11.3 11.4 12.1 12.2 12.7 13.0 13.2 13.2
## [46] 14.4 15.4 15.4 16.1 17.4
# median
xo[length(xo)*.5]
## [1] 7.2
quantile(x, probs=.5, type=4)
## 50% 
## 7.2
# min
xo[1]
## [1] 0.8
min(xo)
## [1] 0.8
quantile(xo,probs=0)
##  0% 
## 0.8

The boxplot shows the median (solid black line) and interquartile range (\(IQR=\) upper quartile \(-\) lower quartile; filled box),1 as well extreme values as outliers beyond the \(1.5\times IQR\) (points beyond whiskers).

boxplot(USArrests$Murder, main='', ylab='Murder Arrests')
# Raw Observations
stripchart(USArrests$Murder,
    pch='-', col=grey(0,.5), cex=2,
    vert=T, add=T)

4.3 Joint Distributions

Scatterplots are used frequently to summarize the joint relationship between two variables. They can be enhanced in several ways. As a default, use semi-transparent points so as not to hide any points (and perhaps see if your observations are concentrated anywhere).

You can also add regression lines (and confidence intervals), although I will defer this until later.

plot(Murder~UrbanPop, USArrests, pch=16, col=grey(0.,.5))

# Add the line of best fit for pooled data
#reg <- lm(Murder~UrbanPop, data=USArrests)
#abline(reg, lty=2)

Marginal Distributions. You can also show the distributions of each variable along each axis.

# Setup Plot
layout( matrix(c(2,0,1,3), ncol=2, byrow=TRUE),
    widths=c(9/10,1/10), heights=c(1/10,9/10))

# Scatterplot
par(mar=c(4,4,1,1))
plot(Murder~UrbanPop, USArrests, pch=16, col=rgb(0,0,0,.5))

# Add Marginals
par(mar=c(0,4,1,1))
xhist <- hist(USArrests$UrbanPop, plot=FALSE)
barplot(xhist$counts, axes=FALSE, space=0, border=NA)

par(mar=c(4,0,1,1))
yhist <- hist(USArrests$Murder, plot=FALSE)
barplot(yhist$counts, axes=FALSE, space=0, horiz=TRUE, border=NA)

4.4 Conditional Distributions

It is easy to show how distributions change according to a third variable using data splits. E.g.,

# Tailored Histogram 
ylim <- c(0,8)
xbks <-  seq(min(USArrests$Murder)-1, max(USArrests$Murder)+1, by=1)

# Also show more information
# Split Data by Urban Population above/below mean
pop_mean <- mean(USArrests$UrbanPop)
murder_lowpop <- USArrests[USArrests$UrbanPop< pop_mean,'Murder']
murder_highpop <- USArrests[USArrests$UrbanPop>= pop_mean,'Murder']
cols <- c(low=rgb(0,0,1,.75), high=rgb(1,0,0,.75))

par(mfrow=c(1,2))
hist(murder_lowpop,
    breaks=xbks, col=cols[1],
    main='Urban Pop >= Mean', font.main=1,
    xlab='Murder Arrests',
    border=NA, ylim=ylim)

hist(murder_highpop,
    breaks=xbks, col=cols[2],
    main='Urban Pop < Mean', font.main=1,
    xlab='Murder Arrests',
    border=NA, ylim=ylim)

It is sometimes it is preferable to show the ECDF instead. And you can glue various combinations together to convey more information all at once

par(mfrow=c(1,2))
# Full Sample Density
hist(USArrests$Murder, 
    main='Density Function Estimate', font.main=1,
    xlab='Murder Arrests',
    breaks=xbks, freq=F, border=NA)

# Split Sample Distribution Comparison
F_lowpop <- ecdf(murder_lowpop)
plot(F_lowpop, col=cols[1],
    pch=16, xlab='Murder Arrests',
    main='Distribution Function Estimates',
    font.main=1, bty='n')
F_highpop <- ecdf(murder_highpop)
plot(F_highpop, add=T, col=cols[2], pch=16)

legend('bottomright', col=cols,
    pch=16, bty='n', inset=c(0,.1),
    title='% Urban Pop.',
    legend=c('Low (<= Mean)','High (>= Mean)'))

You can also split data into grouped boxplots in the same way

layout( t(c(1,2,2)))
boxplot(USArrests$Murder, main='',
    xlab='All Data', ylab='Murder Arrests')

# K Groups with even spacing
K <- 3
USArrests$UrbanPop_Kcut <- cut(USArrests$UrbanPop,K)
Kcols <- hcl.colors(K,alpha=.5)
boxplot(Murder~UrbanPop_Kcut, USArrests,
    main='', col=Kcols,
    xlab='Urban Population', ylab='')

# 4 Groups with equal numbers of observations
#Qcuts <- c(
#    '0%'=min(USArrests$UrbanPop)-10*.Machine$double.eps,
#    quantile(USArrests$UrbanPop, probs=c(.25,.5,.75,1)))
#USArrests$UrbanPop_cut <- cut(USArrests$UrbanPop, Qcuts)
#boxplot(Murder~UrbanPop_cut, USArrests, col=hcl.colors(4,alpha=.5))

Conditional Relationships. You can also use size, color, and shape to further distinguish different conditional relationships.

# High Assault Areas
assault_high <- USArrests$Assault > median(USArrests$Assault)
cols <- ifelse(assault_high, rgb(1,0,0,.5), rgb(0,0,1,.5))

# Scatterplot
# Show High Assault Areas via 'cex=' or 'pch='
plot(Murder~UrbanPop, USArrests, pch=16, col=cols)

# Could also add regression lines y for each data split
#reg_high <- lm(Murder~UrbanPop, data=USArrests[assault_high,])
#abline(reg_high, lty=2, col=rgb(1,0,0,1))
#reg_low <- lm(Murder~UrbanPop, data=USArrests[!assault_high,])
#abline(reg_low, lty=2, col= rgb(0,0,1,1))

4.5 Random Variables

Random variables are vectors that are generated from a probabilistic process.

  • The sample space of a random variable refers to the set of all possible outcomes.
  • The probability of a particular set of outcomes is the proportion that those outcomes occur in the long run.

There are two basic types of sample spaces:

Discrete. The random variable can take one of several discrete values. E.g., any number in \(\{1,2,3,...\}\).

# Bernoulli (Coin Flip: Heads=1 Tails=0)
rbinom(1, 1, 0.5) # 1 draw
## [1] 0
rbinom(4, 1, 0.5) # 4 draws
## [1] 0 0 1 1
x0 <- rbinom(600, 1, 0.5)

# Cumulative Averages
x0_t <- seq_len(length(x0))
x0_mt <- cumsum(x0)/x0_t
plot(x0_t, x0_mt, type='l',
    ylab='Cumulative Average',
    xlab='Flip #')

# Long run proportions
x0 <- rbinom(2000, 1, 0.5)
hist(x0, breaks=50, border=NA, main=NA, freq=T)

# Bernoulli (Unfair Coin Flip)
x0 <- rbinom(2000, 1, 0.2)
hist(x0, breaks=50, border=NA, main=NA, freq=T)

# Discrete Uniform (numbers 1,...4)
# sample(1:4, 1, replace=T, prob=rep(1/4,4) ) # 1 draw, equal probabilities
x1 <- sample(1:4, 2000, replace=T, prob=rep(1/4,4))
hist(x1, breaks=50, border=NA, main=NA, freq=T)

# Multinoulli (aka Categorical)
x1 <- sample(1:4, 2000, replace=T, prob=c(3,4,1,2)/10) # unequal probabilities
hist(x1, breaks=50, border=NA, main=NA, freq=T)

Continuous. The random variable can take one value out of an uncountably infinite number. E.g., any number between \([0,1]\) allowing for any number of decimal points.

# Continuous Uniform
runif(3) # 3 draws
## [1] 0.02080043 0.56223900 0.57047505
x2 <- runif(2000)
hist(x2, breaks=20, border=NA, main=NA, freq=F)

# Normal (Gaussian)
rnorm(3) # 3 draws
## [1] -0.7031871 -2.4176092 -0.4144168
x3 <- rnorm(2000)
hist(x3, breaks=20, border=NA, main=NA, freq=F)

We might further distinguish types of random variables based on whether their maximum value is theoretically finite or infinite.

Probability distributions.

Random variables are drawn from probability distributions. The most common ones are easily accessible.

All random variables have an associated theoretical Cumulative Distribution Function: \(F_{X}(x) =\) Probability\((X_{i} \leq x)\). Continuous random variables have an associated density function: \(f_{X}\), as well as a quantile function: \(Q_{X}(p)\), which is the inverse of the CDF: the \(x\) value where \(p\) percent of the data fall below it.

Here is an example of the Beta distribution

pars <- expand.grid( c(.5,1,2), c(.5,1,2) )
par(mfrow=c(3,3))
apply(pars, 1, function(p){
    x <- seq(0,1,by=.01)
    fx <- dbeta( x,p[1], p[2])
    plot(x, fx, type='l', xlim=c(0,1), ylim=c(0,4), lwd=2)
    #hist(rbeta(2000, p[1], p[2]), breaks=50, border=NA, main=NA, freq=F)
})
title('Beta densities', outer=T, line=-1)

Here is a more in-depth example using the Dagum distribution

# Quantile Function (VGAM::qdagum)
# (In addition to the main equation, there are many checks and options for consistency with other functions)
qdagum <- function(p, scale = 1, shape1.a, shape2.p, lower.tail = TRUE, log.p = FALSE) {
  LLL <- max(length(p), length(shape1.a), length(scale), length(shape2.p))
  
  if (length(p) < LLL) p <- rep_len(p, LLL)
  if (length(shape1.a) < LLL) shape1.a <- rep_len(shape1.a, LLL)
  if (length(scale) < LLL) scale <- rep_len(scale, LLL)
  if (length(shape2.p) < LLL) shape2.p <- rep_len(shape2.p, LLL)

  if (lower.tail) {
    if (log.p) {
      ln.p <- p
      ans <- scale * (expm1(-ln.p / shape2.p))^(-1 / shape1.a)
      ans[ln.p > 0] <- NaN
    } else {
      ans <- scale * (expm1(-log(p) / shape2.p))^(-1 / shape1.a)
      ans[p < 0] <- NaN
      ans[p == 0] <- 0
      ans[p == 1] <- Inf
      ans[p > 1] <- NaN
    }
  } else {
    if (log.p) {
      ln.p <- p
      ans <- scale * (expm1(-log(-expm1(ln.p)) / shape2.p))^(-1 / shape1.a)
      ans[ln.p > 0] <- NaN
    } else {
      # Main equation (theoretically derived from the CDF)
      ans <- scale * (expm1(-log1p(-p) / shape2.p))^(-1 / shape1.a)
      ans[p < 0] <- NaN
      ans[p == 0] <- Inf
      ans[p == 1] <- 0
      ans[p > 1] <- NaN
    }
  }
  ans[scale <= 0 | shape1.a <= 0 | shape2.p <= 0] <- NaN
  return(ans)
}

# Generate Random Variables (VGAM::rdagum)
rdagum <-function(n, scale=1, shape1.a, shape2.p){
    p <- runif(n) # generate a random quantile
    qdagum(p, scale=scale, shape1.a=shape1.a, shape2.p=shape2.p)
}

# Example
set.seed(123)
x <- rdagum(3000,1,3,1)

# Empirical Distribution
Fx_hat <- ecdf(x)
plot(Fx_hat, lwd=2, xlim=c(0,5))

# Two Quantiles
p <- c(.25, .9)
cols <- c(2,4)
Qx_hat <- quantile(x, p)
segments(Qx_hat,p,-10,p, col=cols)
segments(Qx_hat,p,Qx_hat,0, col=cols)
mtext( round(Qx_hat,2), 1, at=Qx_hat, col=cols)

We will return to the theory behind probability distributions in a later chapter.


  1. Technically, the upper and lower ``hinges’’ use two different versions of the first and third quartile. See https://stackoverflow.com/questions/40634693/lower-and-upper-quartiles-in-boxplot-in-r↩︎