2 Data & Visualization

2.1 Data Types

Basic Types.

The two basic types of data are cardinal (aka numeric) data and factor data. We can further distinguish between whether cardinal data are discrete or continuous. We can also further distinguish between whether factor data are ordered or not

Cardinal (Numeric): the difference between elements always means the same thing.
- Discrete: E.g. $\{ 1,2,3\}$ and notice that $2-1=3-2$.
- Continuous: E.g., $\{1.4348, 2.4348, 2.9, 3.9 \}$ and notice that $2.9-1.4348=3.9-2.4348$
Factor: the difference between elements does not always mean the same thing.
- Ordered: E.g., $\{1^{st}, 2^{nd}, 3^{rd}\}$ place in a race and notice that $1^{st}$ - $2^{nd}$ place does not equal $2^{nd}$ - $3^{rd}$ place for a very competitive person who cares only about winning.
- Unordered (categorical): E.g., $\{Amanda, Bert, Charlie\}$ and notice that $Amanda - Bert$ never makes sense.

Note that for theoretical analysis, the types are sometimes grouped differently as

Discrete (discrete cardinal, ordered factor, and unordered factor data). You can count the potential values. E.g., the set $\{A,B,C\}$ has three potential values.
Continuous (continuous cardinal data). There are uncountably infinite potential values. E.g., Try counting the numbers between $0$ and $1$ including decimal points, and notice that any two numbers have another potential number between them.

Test Yourself

Here are some examples

Code

dat_card1 <- c(1, 2, 3) # Cardinal data (Discrete)
dat_card1
## [1] 1 2 3

dat_card2 <- c(1.1, 2/3, 3) # Cardinal data (Continuous)
dat_card2
## [1] 1.1000000 0.6666667 3.0000000

dat_fact1 <- factor( c('A', 'B', 'C'), ordered=T) # Factor data (Ordinal)
dat_fact1
## [1] A B C
## Levels: A < B < C

dat_fact2 <- factor( c('Leipzig', 'Los Angeles', 'Logan'), ordered=F) # Factor data (Categorical)
dat_fact2
## [1] Leipzig     Los Angeles Logan      
## Levels: Leipzig Logan Los Angeles

dat_fact3 <- factor( c(T, F), ordered=F) # Factor data (Categorical)
dat_fact3
## [1] TRUE  FALSE
## Levels: FALSE TRUE

# Explicitly check the data types:
#class(dat_card1)
#class(dat_card2)

Strings.

Note that R allows for unstructured plain text, called character strings, which we can then format as factors

Code

c('A', 'B', 'C')  # character strings
## [1] "A" "B" "C"
c('Leipzig', 'Los Angeles', 'Logan')  # character strings
## [1] "Leipzig"     "Los Angeles" "Logan"

Also note that strings are encounter in a variety of settings, and you often have to format them after reading them into R (the RStudio regex cheatsheet is a handy reference).¹

Code

# Strings
paste( 'hi', 'mom')
## [1] "hi mom"
paste( c('hi', 'mom'), collapse='--')
## [1] "hi--mom"

kingText <- 'The king infringes the law on playing curling.'
gsub(pattern='ing', replacement='', kingText)
## [1] "The k infres the law on play curl."
# advanced usage
#gsub('[aeiouy]', '_', kingText)
#gsub('([[:alpha:]]{3})ing\\b', '\\1', kingText)

2.2 Datasets

Datasets can be stored in a variety of formats on your computer. But they can be analyzed in R in three basic ways.

Lists.

Lists are probably the most basic type

Code

x <- seq(1, 10)
y <- 2*x
list(x, y)  # list of vectors
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
##  [1]  2  4  6  8 10 12 14 16 18 20

x_mat1 <- matrix( seq(2, 7), 2, 3)
x_mat2 <- matrix( seq(4, -1), 2, 3)
list(x_mat1, x_mat2)  # list of matrices
## [[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1

Lists are useful for storing unstructured data

Code

list(list(x_mat1), list(x_mat2))  # list of lists
## [[1]]
## [[1]][[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## 
## [[2]]
## [[2]][[1]]
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1

list(x_mat1, list(x_mat1, x_mat2)) # list of different objects
## [[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[2]]
## [[2]][[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[2]][[2]]
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1

# ...inception...
list(x_mat1,
    list(x_mat1, x_mat2), 
    list(x_mat1, list(x_mat2)
    )) 
## [[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[2]]
## [[2]][[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[2]][[2]]
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1
## 
## 
## [[3]]
## [[3]][[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[3]][[2]]
## [[3]][[2]][[1]]
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1

Data.frames.

A data.frame looks like a matrix but each column can be a different type of vector (or even list). This allows you to combine different data types into a single object for analysis, which is why it might be your most common object.

Code

# data.frames: your most common data type
    # matrix of different data-types
    # well-ordered lists
data.frame(x, y)  # list of vectors
##     x  y
## 1   1  2
## 2   2  4
## 3   3  6
## 4   4  8
## 5   5 10
## 6   6 12
## 7   7 14
## 8   8 16
## 9   9 18
## 10 10 20

Test Yourself

Create a data.frame storing two different types of data. Then show print only the second column

Code

dat <- data.frame(x=dat_fact2, y=dat_card2)
dat
##             x         y
## 1     Leipzig 1.1000000
## 2 Los Angeles 0.6666667
## 3       Logan 3.0000000

dat[, 'y']
## [1] 1.1000000 0.6666667 3.0000000

Arrays.

Arrays are generalization of matrices to multiple dimensions. They are a very efficient way to store well-formatted numeric data, and are often used in spatial econometrics and time series (often in the form of “data cubes”).

Code

# data square (matrix)
array(data = seq(1, 24), dim=c(3, 8))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    4    7   10   13   16   19   22
## [2,]    2    5    8   11   14   17   20   23
## [3,]    3    6    9   12   15   18   21   24

# data cube
a <- array(data = seq(1, 24), dim=c(3, 2, 4))
a
## , , 1
## 
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    7   10
## [2,]    8   11
## [3,]    9   12
## 
## , , 3
## 
##      [,1] [,2]
## [1,]   13   16
## [2,]   14   17
## [3,]   15   18
## 
## , , 4
## 
##      [,1] [,2]
## [1,]   19   22
## [2,]   20   23
## [3,]   21   24

Code

a[1, , , drop = FALSE]  # Row 1
#a[, 1, , drop = FALSE]  # Column 1
#a[, , 1, drop = FALSE]  # Layer 1

a[ 1, 1,  ]  # Row 1, column 1
#a[ 1,  , 1]  # Row 1, 'layer' 1
#a[  , 1, 1]  # Column 1, 'layer' 1
a[1 , 1, 1]  # Row 1, column 1, 'layer' 1

Apply extends to arrays

Code

apply(a, 1, mean)    # Row means
## [1] 11.5 12.5 13.5
apply(a, 2, mean)    # Column means
## [1] 11 14
apply(a, 3, mean)    # 'Layer' means
## [1]  3.5  9.5 15.5 21.5
apply(a, c(1, 2), mean)  # Row/Column combination 
##      [,1] [,2]
## [1,]   10   13
## [2,]   11   14
## [3,]   12   15

Outer products yield arrays

Code

x <- c(1, 2, 3)
x_mat1 <- outer(x, x) # x %o% x
x_mat1
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    2    4    6
## [3,]    3    6    9
is.array(x_mat1) # Matrices are arrays
## [1] TRUE

x_mat2 <- matrix( seq(6, 1), 2, 3)
outer(x_mat2, x)
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    6    4    2
## [2,]    5    3    1
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   12    8    4
## [2,]   10    6    2
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   18   12    6
## [2,]   15    9    3
# outer(x_mat2, matrix(x))
# outer(x_mat2, t(x))
# outer(x_mat1, x_mat2)

Initial Data Inspection.

Regardless of the data types you have, you typically begin by inspecting your data by examining the first few observations.

Consider, for example, historical data on crime in the US.

Code

head(USArrests) # Actual Data
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

# Check NA values
X <- c(3, 3.1, NA, 0.02) #Small dataset we will use in numerical examples
sum( is.na(X) )
## [1] 1

2.3 Proportions

In what follows, we will often work with data as vector, where there are $n$ observations and $\hat{X}_{i}$ is the value of the $i$th one. We often analyze observations in comparison to some value $x$.

Code

X <- c(3, 3.1, 0.02) # Data: 'big X'
x <- 2 # particular value: 'little x'
sum(X <= x)
## [1] 1
sum(X == x)
## [1] 0

Empirical Mass Function

If you have factor data, or discrete cardinal data, you can directly plot the proportions. For each unique outcome $x$ we compute \[\begin{eqnarray} \hat{p}_{x}=\sum_{i=1}^{n}\mathbf{1}\left(\hat{X}_{i}=x\right)/n. \end{eqnarray}\] where $n$ is the number of observations and $\sum_{i=1}^{n}\mathbf{1}\left(\hat{X}_{i}=x\right)$ counts the number of observations equal to $x$.

Code

# Discretized data
X <- USArrests[, 'Murder']
Xr <- floor(X) #rounded down, to be discrete

#table(Xr)
proportions <- table(Xr)/length(Xr)
plot(proportions, col=grey(0, .5),
    xlab='Murder Arrests (Discretized)',
    ylab='Proportion of States with each value')

There are several stylistic variations (a bar plot uses thick bars for each outcome, a lollipop plot uses skinny lines with a circle on top). The most important point is that the height of each line equals the proportion of data with a specific value.

Histogram Density.

If you have continuous data, you can use the histogram to measure the proportion of the data in different bins. It does so by dividing the range of the data into exclusive bins of equal “half-width”, $h$, and count the number of observations within each bin. We often rescale the counts, dividing by the number of observations $n$ multiplied by bin half-width $h$ (equivalently bin width $2h$), so that the total area of the histogram sums to one, which allows us to interpret the numbers as a density.

Mathematically, for an exclusive bin $\left(x-h, x+h\right]$ defined by their midpoint $x$ and half-width $h$, we compute \[\begin{eqnarray} \hat{f}(x) &=& \frac{ \sum_{i=1}^{n} \mathbf{1}\left( \hat{X}_{i} \in \left(x-h, x+h \right] \right) }{n 2h}, \end{eqnarray}\] where $\mathbf{1}()$ is an indicator function that equals $1$ if the expression inside is TRUE and $0$ otherwise.

Test Yourself

E.g., suppose $\hat{X}_{i}=3.8$ and $h=1/2$.

For $x=1$ we have $\mathbf{1}\left( \hat{X}_{i} \in \left(1-h, 1+h \right] \right)=\mathbf{1}\left( 3.8 \in \left(0.5, 1.5\right] \right)=0$.

For $x=4$ $\mathbf{1}\left( \hat{X}_{i} \in \left(4-h, 4+h \right] \right)=\mathbf{1}\left( 3.8 \in \left(3.5, 4.5\right] \right)=1$.

Note that rectangle area equals the proportion of data in the bin. Recalling that the area of the rectangle is “base x height”: \[\begin{eqnarray} 2h \times \hat{f}(x) &=& \frac{\sum_{i=1}^{n} \mathbf{1}\left( \hat{X}_{i} \in \left(x-h, x+h \right] \right)}{n}. \end{eqnarray}\] We compute $\hat{f}(x)$ for each bin midpoint $x$, and the area of all rectangles sums to one. ²

Must Know

For example, consider the dataset $\{3,3.1,0.02\}$ and use bins $(0,1], (1,2], (2,3], (3,4]$. In this case, the midpoints are $x=(0.5,1.5,2.5,3.5)$ and $h=1/2$. Then the counts at each midpoints are $(1,0,1,1)$. Since $\frac{1}{n 2h}=\frac{1}{3\times1}=\frac{1}{3}$, we can rescale the counts to compute the density as $\hat{f}(x)=(1,0,1,1) \frac{1}{3}=(1/3,0,1/3,1/3)$.

Code

# Intuitive Examples
X <- c(3, 3.1, 0.02)
Xhist <- hist(X, breaks=c(0, 1, 2, 3, 4), plot=F, freq=F)
Xhist
## $breaks
## [1] 0 1 2 3 4
## 
## $counts
## [1] 1 0 1 1
## 
## $density
## [1] 0.3333333 0.0000000 0.3333333 0.3333333
## 
## $mids
## [1] 0.5 1.5 2.5 3.5
## 
## $xname
## [1] "X"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

# base x height
base <- 1
height <- Xhist[['density']]
sum(base*height)
## [1] 1

For another example, use the bins $(0,2]$ and $(2,4]$. So the midpoints are $1$ and $3$, and the bin half-width is $1$. Only one observation, $0.02$, falls in the bin $(0,2]$. The other two observations, $3$ and $3.1$, fall into the bin $(2,4]$. The scaling factor is $\frac{1}{n 2h}=\frac{1}{3\times 2 \times 1}=\frac{1}{6}$. So the first bin has density $\hat{f}(1)=1\frac{1}{6}=1/6$ and the second bin has density $\hat{f}(3)=2\frac{1}{6}=2/6$. The area of the first bin’s rectangle is $2 \times \hat{f}(1)=2/6=1/3$ and the area of the second rectangle is $2 \times \hat{f}(3)=4/6=2/3$.

Now intuitively work through an example with three bins instead of four. Compute the areas

Code

Xhist <- hist(X, breaks=c(0, 4/3, 8/3, 4), plot=F, freq=F)
base <- 4/3
height <- Xhist[['density']]
sum(base*height)
## [1] 1

# as a default, R uses bins (, ] instead of [, )
# but you can change that with 'right=F'
# hist(X, breaks=c(0, 4/3, 8/3, 4), plot=F, right=F)

Code

# Practical Example
X <- USArrests[, 'Murder']
hist(X,
    breaks=seq(0, 20, by=1), #bin width=1
    freq=F,
    border=NA,
    main=NA,
    xlab='Murder Arrests')
title('Murder Arrests', font.main=1)
# Raw Observations
rug(USArrests[, 'Murder'], col=grey(0, .5))

Code


# Since h=1, the density equals the proportion of states in each bin
# Redo this example with h=2

Note that histograms qualitatively depict proportions over multiple bins (since each bar represents the proportion of data in its bins), but you can also directly compute proportions.

Test Yourself

E.g., the proportion of murder rates in $(2,4)$ is

Code

mean( X > 2 & X <4)
## [1] 0.22

Kernel Density.

A kernel density is a “smooth” version of a histogram.

The general idea behind kernel density is to use small bins around each $x$ that potentially overlap, rather than partitioning the range of $\hat{X}_{i}$ into exclusive bins. The “uniform/rectangular” kernel density places a weight of $1/2$ for all points in the interval $\left[ x-h, x + h\right]$ \[\begin{eqnarray} \label{eqn:uniform} k_{U}\left( \hat{X}_{i}, x, h \right) &=& \frac{\mathbf{1}\left(\frac{|\hat{X}_{i}-x|}{h} \leq 1\right)}{2} = \frac{\mathbf{1}\left( \hat{X}_{i} \in \left[ x-h, x + h\right] \right) }{2} \\ \hat{f}_{U}(x) &=& \frac{1}{nh} \sum_{i}^{n} k_{U}(\hat{X}_{i}, x, h) = \frac{ \sum_{i}^{n} \mathbf{1}\left( \hat{X}_{i} \in \left[ x-h, x + h\right] \right) }{n 2 h} \end{eqnarray}\]

Notice that the uniform kernel is essentially the histogram but without the restriction that $x$ must be a midpoint of exclusive bins. Typically, the points $x$ are chosen to be either the unique observations or some equidistant set of “design points” (e.g., at $512$ evenly spaced values of $x$ the dataset, not just the midpoints of exclusive bins).

Code

# Practical Example
X <- USArrests[, 'Murder']
hist(X,
    breaks=seq(0, 20, by=2), #bin width=1
    freq=F,
    border=NA,
    main=NA,
    xlab='Murder Arrests')
title('Murder Arrests', font.main=1)
# Density Estimate
lines( density(X, bw=3, kernel='rectangular') )
# Raw Observations
rug(X, col=grey(0, .5))

We can also replace the uniform kernel with other kernel functions to create even smoother lines.

Test Yourself

There are many kernels, but these are the most intuitive and commonly used.

Code

# Kernel Density Functions

x <- 0 # Design point
X <- seq(-2, 2, length.out=1001) # Where to compute weight


plot.new()
plot.window(xlim=c(-1.2, 1.2), ylim=c(0, 1))

## Uniform/rectangular
d_unif <- function(X, x, h){
    u <- abs(X-x)/h
    fu <- 1/2*(u <= 1)
    return(fu)
}
h <- 1
kU <- d_unif(X, x, h)
kern_cols <- hcl.colors(3, alpha=.45)
lines( kU ~ X, col=kern_cols[1], lty=1)

## Epanachnikov
d_epan <- function(X, x, h){
    u <- abs(X-x)/h
    fu <- 3/4*(1-u^2)*(u <= 1)
    return(fu)
}
h <- 1
kE <- d_epan(X, x, h)
lines( kE ~ X, col=kern_cols[2], lty=1)
## Try others using the 'density' function
# Note a slightly different definition for bandwith 
#kE <-  density(x=0, bw=h, kernel='epanechnikov')

## Tricubic
d_tricub <- function(X, x, h){
    u <- abs(X-x)/h
    fu <- 70/81*(1-u^3)^3*(u <= 1)/h
    return(fu)
}
h <- 1
kT <- d_tricub(X, x, h)
lines( kT ~ X, col=kern_cols[3], lty=1)


rug(0, lwd=2)
axis(1)
axis(2)

legend('topright', lty=1, col=kern_cols,
    legend=c('Uniform(1)', 'Epanachnikov(1)', 'Tricubic(1)'))

Once we have picked a kernel (which particular one is not particularly important) we can use it to compute density estimates at each design point.

Code

# Practical Example
X <- USArrests[, 'Murder']
hist(X,
    breaks=seq(0, 20, by=2), #bin width=1
    freq=F,
    border=NA,
    main=NA,
    xlab='Murder Arrests')
title('Murder Arrests', font.main=1)
# Density Estimate
lines( density(X, bw=2, kernel='epanechnikov') )
# Raw Observations
rug(X, col=grey(0, .5))

2.4 Distributions

**Empirical Cumulative Distribution Function**.

The ECDF counts the proportion of observations whose values are less than or equal to $x$; \[\begin{eqnarray} \hat{F}(x) = \frac{1}{n} \sum_{i}^{n} \mathbf{1}( \hat{X}_{i} \leq x). \end{eqnarray}\] Typically, we compute this for each unique value of $x$ in the dataset. The ECDF jumps up by $1/n$ at each of the $n$ data points (see this ECDF walkthrough for visual examples).

Must Know

For example, consider the dataset $\{3,3.1,0.02 \}$. We reorder the observations as $\{0.02, 3, 3.1 \}$, so that there are discrete jumps of $1/n=1/3$ at each value. Consider the points $x \in \{0.5,2.5,3.5\}$. At $x=0.5$, $F(0.5)$ measures the proportion of the data $\leq 0.5$. Since only one observations, $0.02$, of three is $\leq 0.5$, we can compute $F(0.5)=1/3$. Similarly, since only one observations, $0.02$, of three is $\leq 2.5$, we can compute $F(2.5)=1/3$. Since all observations are $\leq 3.5$, we can compute $F(3.5)=1$.

Code

X <- c(3, 3.1, 0.02)
Fhat <- ecdf(X)

# Visualized
plot(Fhat)

Code


# Evaluated at the data
x <- X
Fhat(X)
## [1] 0.6666667 1.0000000 0.3333333
#sum(X<=3)/length(X)
#sum(X<=3.1)/length(X)
#sum(X<=0.02)/length(X)

# Evaluated at other points
x <- c(0.5, 2.5, 3.5)
Fhat(x)
## [1] 0.3333333 0.3333333 1.0000000
#sum(X<=0.5)/length(X)
#sum(X<=2.5)/length(X)
#sum(X<=3.5)/length(X)

Code

F_murder <- ecdf(USArrests[, 'Murder'])
# proportion of murders <= 10
F_murder(10)
## [1] 0.7
# proportion of murders <= x, for all x
plot(F_murder, main=NA,
    xlab='Murder Arrests (x)',
    ylab='Proportion of States with Murder Arrests <= x',
    pch=16, col=grey(0, .5))
title('Murder Arrests ECDF', font.main=1)
rug(USArrests[, 'Murder'])

Quantiles.

You can summarize the distribution of data using quantiles: the $p$th quantile is a value of $x$ where $p$ percent of the data are below $x$ and ($1-p$) percent are above $x$.

The min is the smallest value (or the most negative value if there are any), where $0%$ of the data has lower values.
The median is the middle value, where one half of the data has lower values and the other half has higher values.
The max is the largest value (or the most negative value if there are any), where $100%$ of the data has lower values.

Must Know

For example, if $X=(0,0,0.02,3,5)$ then the median is $0.02$, the lower quartile is $0$, and the upper quartile is $3$. (The number $0$ is also special: the most frequent observation is called the mode.)

Code

X <-  c(3.1, 3, 0.02)
quantile(X, probs=c(0, .5, 1))
##   0%  50% 100% 
## 0.02 3.00 3.10

Now work through an intuitive example with observations $\{1,2,...,13\}$. Hint: split the ordered observations into four groups.

We are often also interested in quartiles: where $25\%$ of the data fall below $x$ (lower quartile) and where $75\%$ of the data fall below $x$ (upper quartile). Sometimes we are also interested in deciles: where $10\%$ of the data fall below $x$ (lower decile) and where $90\%$ of the data fall below $x$ (upper decile). In general, we can use the empirical quantile function \[\begin{eqnarray} \hat{Q}(p) = \hat{F}^{-1}(p), \end{eqnarray}\] to compute a quantile for any probability $p\in [0,1]$. The median corresponds to $p=0.5$, upper quartile to $p=0.75$, and upper decile to $p=0.9$.

Code

# common quantiles
X <- USArrests[, 'Murder']
quantile(X)
##     0%    25%    50%    75%   100% 
##  0.800  4.075  7.250 11.250 17.400

# All deciles are quantiles
quantile(X, probs=seq(0, 1, by=.1))
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##  0.80  2.56  3.38  4.75  6.00  7.25  8.62 10.12 12.12 13.32 17.40

# Visualized: Inverting the Empirical Distribution
FX_hat <- ecdf(X)
plot(FX_hat, lwd=2, xlim=c(0, 20),
    pch=16, col=grey(0, .5), main=NA)
title('Inverting the ECDF', font.main=1)
# Two Examples of Quantiles
p <- c(.25, .9) # Lower Quartile, Upper Decile
cols <- c(rgb(1, 0, 0, .8), rgb(0, 0, 1, .8))
QX_hat <- quantile(X, p, type=1)
QX_hat
##  25%  90% 
##  4.0 13.2
segments(QX_hat, p, -10, p, col=cols)
segments(QX_hat, p, QX_hat, 0, col=cols)
mtext( round(QX_hat, 2), 1, at=QX_hat, col=cols)

There are some issues with quantiles with smaller datasets. E.g., to compute the median of $\{3,3.1,0,1\}$, we need some ways to break ties (of which there are many options). The most common way is perhaps a simple average of the tied values: $(3+1)/2=2$.

Test Yourself

To calculate quantiles, the computer defaults to sorting the observations from smallest to largest as $\hat{X}_{(1)}, \hat{X}_{(2)},... \hat{X}_{(n)}$, and then computes quantiles as $\hat{X}_{ (p \times n) }$, where $p \in [0,1]$ refers to probability and $(p\times n)$ is rounded. There are different ways to break ties and the min ($p=0$) is a special case.

Code

X <- USArrests[, 'Murder']
Xo <- sort(X)
Xo
##  [1]  0.8  2.1  2.1  2.2  2.2  2.6  2.6  2.7  3.2  3.3  3.4  3.8  4.0  4.3  4.4
## [16]  4.9  5.3  5.7  5.9  6.0  6.0  6.3  6.6  6.8  7.2  7.3  7.4  7.9  8.1  8.5
## [31]  8.8  9.0  9.0  9.7 10.0 10.4 11.1 11.3 11.4 12.1 12.2 12.7 13.0 13.2 13.2
## [46] 14.4 15.4 15.4 16.1 17.4

# median
Xo[ceiling(length(Xo)*.5)]
## [1] 7.2
quantile(X, probs=.5, type=1) #tie break rule #1
## 50% 
## 7.2

## arbitrary quantiles
p <- 0.11
Xo[ceiling(length(Xo)*p)]
## [1] 2.6
quantile(X, probs=p, type=1)
## 11% 
## 2.6

# min is a special case
min(Xo)
## [1] 0.8
Xo[ceiling(length(Xo)*0)]
## numeric(0)
quantile(X, probs=0, type=1)
##  0% 
## 0.8
Xo[1]
## [1] 0.8

Boxplots.

Boxplots also summarize the distribution. The boxplot shows the median (solid black line) and interquartile range ($IQR=$ upper quartile $-$ lower quartile; filled box).³ As a default, whiskers are shown as $1.5\times IQR$ and values beyond that are highlighted as outliers (so whiskers do not typically show the data range). You can alternatively show all the raw data points instead of whisker+outliers.

Code

boxplot(USArrests[, 'Murder'],
    main=NA, ylab='Murder Arrests',
    whisklty=0, staplelty=0, outline=F)
title('Murder Arrests', font.main=1)
# Raw Observations
stripchart(USArrests[, 'Murder'],
    pch='-', col=grey(0, .5), cex=2,
    vert=T, add=T)

These univariate visualization tools extend to examining relationships between two variables. See Bivariate Distributions for how histograms, ECDFs, and boxplots adapt to bivariate data.

2.5 Exercises

Classify each of the following as cardinal (discrete or continuous) or factor (ordered or unordered): (a) temperature in Celsius, (b) letter grade on a transcript, (c) number of siblings, (d) province of residence. Briefly explain each choice.
The built-in dataset mtcars has a column mpg. Compute the histogram density by hand for the bins $(10,15]$, $(15,20]$, $(20,25]$, $(25,30]$, $(30,35]$. That is, for each bin count the number of observations, then divide by $n \times 2h$ where $2h=5$. Verify your answer matches hist(mtcars$mpg, breaks=seq(10,35,by=5), freq=F).
Load USArrests in R. Compute the ECDF of the Assault variable and use it to find the proportion of states with 200 or fewer assault arrests. Then find the upper quartile of Assault using quantile.

2.1 Data Types

Basic Types.

Strings.

2.2 Datasets

Lists.

Data.frames.

Arrays.

Initial Data Inspection.

2.3 Proportions

Empirical Mass Function

Histogram Density.

Kernel Density.

2.4 Distributions

Empirical Cumulative Distribution Function.

Quantiles.

Boxplots.

2.5 Exercises

Further Reading.

**Empirical Cumulative Distribution Function**.