3  Data


3.1 Types

Basic Types.

The two basic types of data are cardinal (aka numeric) data and factor data. We can further distinguish between whether cardinal data are discrete or continuous. We can also further distinguish between whether factor data are ordered or not

  • Cardinal (Numeric): the difference between elements always means the same thing.
    • Discrete: E.g., \(2-1=3-2\).
    • Continuous: E.g., \(2.9-1.4348=3.9-2.4348\)
  • Factor: the difference between elements does not always mean the same thing.
    • Ordered: E.g., First place - Second place ?? Second place - Third place.
    • Unordered (categorical): E.g., A - B ????

Here are some examples

Code
dat_card1 <- 1:3 # Cardinal data (Discrete)
dat_card1
## [1] 1 2 3

dat_card2 <- c(1.1, 2/3, 3) # Cardinal data (Continuous)
dat_card2
## [1] 1.1000000 0.6666667 3.0000000

dat_fact1 <- factor( c('A','B','C'), ordered=T) # Factor data (Ordinal)
dat_fact1
## [1] A B C
## Levels: A < B < C

dat_fact2 <- factor( c('Leipzig','Los Angeles','Logan'), ordered=F) # Factor data (Categorical)
dat_fact2
## [1] Leipzig     Los Angeles Logan      
## Levels: Leipzig Logan Los Angeles

dat_fact3 <- factor( c(T,F), ordered=F) # Factor data (Categorical)
dat_fact3
## [1] TRUE  FALSE
## Levels: FALSE TRUE

# Explicitly check the data types:
#class(dat_card1)
#class(dat_card2)

Note that for theoretical analysis, the types are sometimes grouped differently as

  • continuous (continuous cardinal data)
  • discrete (discrete cardinal, ordered factor, and unordered factor data)

In any case, data are often computationally analyzed as data.frame objects, discussed below.

Strings.

Note that R allows for unstructured plain text, called character strings, which we can then format as factors

Code
c('A','B','C')  # character strings
## [1] "A" "B" "C"
c('Leipzig','Los Angeles','Logan')  # character strings
## [1] "Leipzig"     "Los Angeles" "Logan"

Also note that strings are encounter in a variety of settings, and you often have to format them after reading them into R.1

Code
# Strings
paste( 'hi', 'mom')
## [1] "hi mom"
paste( c('hi', 'mom'), collapse='--')
## [1] "hi--mom"

kingText <- "The king infringes the law on playing curling."
gsub(pattern="ing", replacement="", kingText)
## [1] "The k infres the law on play curl."
# advanced usage
#gsub("[aeiouy]", "_", kingText)
#gsub("([[:alpha:]]{3})ing\\b", "\\1", kingText) 

See

3.2 Datasets

Datasets can be stored in a variety of formats on your computer. But they can be analyzed in R in three basic ways.

Lists.

Lists are probably the most basic type

Code
x <- 1:10
y <- 2*x
list(x, y)  # list of vectors
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
##  [1]  2  4  6  8 10 12 14 16 18 20

x_mat1 <- matrix(2:7,2,3)
x_mat2 <- matrix(4:-1,2,3)
list(x_mat1, x_mat2)  # list of matrices
## [[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1

Lists are useful for storing unstructured data

Code
list(list(x_mat1), list(x_mat2))  # list of lists
## [[1]]
## [[1]][[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## 
## [[2]]
## [[2]][[1]]
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1

list(x_mat1, list(x_mat1, x_mat2)) # list of different objects
## [[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[2]]
## [[2]][[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[2]][[2]]
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1

# ...inception...
list(x_mat1,
    list(x_mat1, x_mat2), 
    list(x_mat1, list(x_mat2)
    )) 
## [[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[2]]
## [[2]][[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[2]][[2]]
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1
## 
## 
## [[3]]
## [[3]][[1]]
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
## 
## [[3]][[2]]
## [[3]][[2]][[1]]
##      [,1] [,2] [,3]
## [1,]    4    2    0
## [2,]    3    1   -1

Data.frames.

A data.frame looks like a matrix but each column is actually a list rather than a vector. This allows you to combine different data types into a single object for analysis, which is why it might be your most common object.

Code
# data.frames: your most common data type
    # matrix of different data-types
    # well-ordered lists
data.frame(x, y)  # list of vectors
##     x  y
## 1   1  2
## 2   2  4
## 3   3  6
## 4   4  8
## 5   5 10
## 6   6 12
## 7   7 14
## 8   8 16
## 9   9 18
## 10 10 20

# Storing different types of data
d0 <- data.frame(x=dat_fact2, y=dat_card2)
d0
##             x         y
## 1     Leipzig 1.1000000
## 2 Los Angeles 0.6666667
## 3       Logan 3.0000000

d0[,'y'] #d0$y
## [1] 1.1000000 0.6666667 3.0000000

Arrays.

Arrays are generalization of matrices to multiple dimensions. They are a very efficient way to store well-formatted numeric data, and are often used in spatial econometrics and time series (often in the form of “data cubes”).

Code
# data square (matrix)
array(data = 1:24, dim = c(3,8))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    4    7   10   13   16   19   22
## [2,]    2    5    8   11   14   17   20   23
## [3,]    3    6    9   12   15   18   21   24

# data cube
a <- array(data = 1:24, dim = c(3, 2, 4))
a
## , , 1
## 
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    7   10
## [2,]    8   11
## [3,]    9   12
## 
## , , 3
## 
##      [,1] [,2]
## [1,]   13   16
## [2,]   14   17
## [3,]   15   18
## 
## , , 4
## 
##      [,1] [,2]
## [1,]   19   22
## [2,]   20   23
## [3,]   21   24
Code
a[1, , , drop = FALSE]  # Row 1
#a[, 1, , drop = FALSE]  # Column 1
#a[, , 1, drop = FALSE]  # Layer 1

a[ 1, 1,  ]  # Row 1, column 1
#a[ 1,  , 1]  # Row 1, "layer" 1
#a[  , 1, 1]  # Column 1, "layer" 1
a[1 , 1, 1]  # Row 1, column 1, "layer" 1

Apply extends to arrays

Code
apply(a, 1, mean)    # Row means
## [1] 11.5 12.5 13.5
apply(a, 2, mean)    # Column means
## [1] 11 14
apply(a, 3, mean)    # "Layer" means
## [1]  3.5  9.5 15.5 21.5
apply(a, 1:2, mean)  # Row/Column combination 
##      [,1] [,2]
## [1,]   10   13
## [2,]   11   14
## [3,]   12   15

Outer products yield arrays

Code
x <- c(1,2,3)
x_mat1 <- outer(x, x) # x %o% x
x_mat1
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    2    4    6
## [3,]    3    6    9
is.array(x_mat1) # Matrices are arrays
## [1] TRUE

x_mat2 <- matrix(6:1,2,3)
outer(x_mat2, x)
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    6    4    2
## [2,]    5    3    1
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   12    8    4
## [2,]   10    6    2
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   18   12    6
## [2,]   15    9    3
# outer(x_mat2, matrix(x))
# outer(x_mat2, t(x))
# outer(x_mat1, x_mat2)

3.3 Densities and Distributions

Initial Data Inspection.

Regardless of the data types you have, you typically begin by inspecting your data by examining the first few observations.

Consider, for example, historical data on crime in the US.

Code
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

# Check NA values
X <- c(1,NA,2,3)
sum(is.na(X))
## [1] 1

To further examine a particular variable, we look at its distribution. In what follows, we will often work with data as vector \(X=(X_{1}, X_{2}, ....X_{N})\), where there are \(N\) observations and \(X_{i}\) is the value of the \(i\)th one.

Histogram Density Estimate.

The histogram divides the range of the data into \(L\) exclusive bins of equal-width \(h\), and count the number of observations within each bin. We often rescale the counts so that the total area of all bins sums to one, which allows us to interpret the numbers as a density measuring a proportion of the data in each bin. Mathematically, for an exclusive bin \(\left(x-\frac{h}{2}, x+\frac{h}{2} \right]\) defined by their midpoint \(x\) and width \(h\), we compute \[\begin{eqnarray} \widehat{f}_{HIST}(x) &=& \frac{ \sum_{i=1}^{N} \mathbf{1}\left( X_{i} \in \left(x-\frac{h}{2}, x+\frac{h}{2} \right] \right) }{N h}. \end{eqnarray}\] Note that \(\mathbf{1}()\) is an indicator function, which equals \(1\) if the expression inside is TRUE and \(0\) otherwise. I.e., if \(x-\frac{h}{2} < X_{i} \leq x+\frac{h}{2}\) then \(\mathbf{1}\left( X_{i} \in \left(x-\frac{h}{2}, x+\frac{h}{2} \right] \right) =1\). Also note that we compute \(\widehat{f}_{HIST}(x)\) for each bin midpoint \(x\).2

For example, let \(X=(3,3.1,0.02)\) and use bins \((0,1], (1,2], (2,3], (3,4]\). In this case, the midpoints are \(x=(0.5,1.5,2.5,3.5)\) and \(h=1\). Then the counts at each midpoints are \((1,0,0,2)\). Since \(\frac{1}{Nh}=1/3\), we also have \(\widehat{f}(x)=(1/3,0,1/3,1/3)\). Now intuitively work through an example with three bins instead of four.

Code
# Intuitive Examples
X <- c(3,3.1,0.02)
hist(X, breaks=c(0,1,2,3,4), plot=F)

hist(X, breaks=c(0,4/3,8/3,4), plot=F)

# as a default, R uses bins (,] instead of [,)
# but you can change that 
hist(X, breaks=c(0,4/3,8/3,4), plot=F, right=F)
Code
# Practical Example
hist(USArrests[,'Murder'], freq=F, breaks=20,
    border=NA, 
    main='',
    xlab='Murder Arrests',
    ylab='Proportion of States in each bin')
# Raw Observations
rug(USArrests[,'Murder'], col=grey(0,.5))

Note that if you your data are factor data, or discrete cardinal data, you can directly plot the counts or proportions: for each unique outcome \(x\) we compute \(\widehat{p}_{x}=\sum_{i=1}^{N}\mathbf{1}\left(X_{i}=x\right)/N\).

Code
# Discretized data
xr <- floor(USArrests[,'Murder']) #rounded down
#table(xr)
proportions <- table(xr)/length(xr)
plot(proportions, col=grey(0,.5),
    xlab='Murder Rate (Discretized)',
    ylab='Proportion of States with each value')

Empirical Cumulative Distribution Function.

The ECDF counts the proportion of observations whose values are less than or equal to \(x\); \[\begin{eqnarray} \widehat{F}_{ECDF}(x) = \frac{1}{N} \sum_{i}^{N} \mathbf{1}(X_{i} \leq x). \end{eqnarray}\] Typically, we compute this for each unique value of \(x\) in the dataset, but sometimes other values of \(x\) too.

For example, let \(X=(3,3.1,0.02)\) and consider the points \(x=(0.5,1.5,2.5,3.5)\). Then the counts are \((1,1,1,3)\). Since \(N=3\), \(\widehat{F}(x)=(1/3,1/3,1/3,1)\).

Code
F_murder <- ecdf(USArrests[,'Murder'])
# proportion of murders <= 10
F_murder(10)
## [1] 0.7
# proportion of murders <= x, for all x
plot(F_murder, main='', xlab='Murder Arrests',
    pch=16, col=grey(0,.5))
rug(USArrests[,'Murder'])

Boxplots.

Boxplots summarize the distribution of data using quantiles: the \(q\)th quantile is the value where \(q\) percent of the data are below and (\(1-q\)) percent are above.

  • The median is the point where half of the data has lower values and the other half has higher values.
  • The lower quartile is the point where \(25%\) of the data has lower values and the other \(75%\) has higher values.
  • The min is the smallest value (or the most negative value if there are any), where \(0%\) of the data has lower values.

For example, if \(X=(0,0,0.02,3,5)\) then the median is \(0.02\), the lower quartile is \(0\), and the upper quartile is \(3\). (The number \(0\) is also special: the most frequent observation is called the mode.) Now work through an intuitive example with \(N=24\) data points (hint: split the ordered observations into groups of six).

Code
X <-  c(3.1, 3, 0.02)
quantile(X, probs=c(0,.5,1))
##   0%  50% 100% 
## 0.02 3.00 3.10

# quantiles
X <- USArrests[,'Murder']
quantile(X)
##     0%    25%    50%    75%   100% 
##  0.800  4.075  7.250 11.250 17.400

# deciles are quantiles
quantile(X, probs=seq(0,1, by=.1))
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##  0.80  2.56  3.38  4.75  6.00  7.25  8.62 10.12 12.12 13.32 17.40

To actually calculate quantiles, we sort the observations from smallest to largest as \(X_{(1)}, X_{(2)},... X_{(N)}\), and then compute quantiles as \(X_{ (q*N) }\). Note that \((q*N)\) is rounded and there are different ways to break ties.

Code
X <- USArrests[,'Murder']
Xo <- sort(X)
Xo
##  [1]  0.8  2.1  2.1  2.2  2.2  2.6  2.6  2.7  3.2  3.3  3.4  3.8  4.0  4.3  4.4
## [16]  4.9  5.3  5.7  5.9  6.0  6.0  6.3  6.6  6.8  7.2  7.3  7.4  7.9  8.1  8.5
## [31]  8.8  9.0  9.0  9.7 10.0 10.4 11.1 11.3 11.4 12.1 12.2 12.7 13.0 13.2 13.2
## [46] 14.4 15.4 15.4 16.1 17.4

# median
Xo[length(Xo)*.5]
## [1] 7.2
quantile(X, probs=.5, type=4)
## 50% 
## 7.2

# min
Xo[1]
## [1] 0.8
min(Xo)
## [1] 0.8
quantile(Xo, probs=0)
##  0% 
## 0.8

The boxplot shows the median (solid black line) and interquartile range (\(IQR=\) upper quartile \(-\) lower quartile; filled box).3 As a default, whiskers are shown as \(1.5\times IQR\) and values beyond that are highlighted as outliers—so whiskers do not typically show the data range. You can alternatively show all the raw data points instead of whisker+outliers.

Code
boxplot(USArrests[,'Murder'],
    main='', ylab='Murder Arrests',
    whisklty=0, staplelty=0, outline=F)
# Raw Observations
stripchart(USArrests[,'Murder'],
    pch='-', col=grey(0,.5), cex=2,
    vert=T, add=T)


  1. We will not cover the statistical analysis of text in this course, but strings are amenable to statistical analysis.↩︎

  2. If the bins exactly span the range, then \(h=[\text{max}(X_{i}) - \text{min}(X_{i})]/L\) and \(x\in \left\{ \frac{\ell h}{2} + \text{min}(X_{i}) \right\}_{\ell=1}^{L}\).↩︎

  3. Technically, the upper and lower hinges use two different versions of the first and third quartile. See https://stackoverflow.com/questions/40634693/lower-and-upper-quartiles-in-boxplot-in-r.↩︎