The two basic types of data are cardinal and factor data. We can further distinguish between whether cardinal data are discrete or continuous. We can also further distinguish between whether factor data are ordered or not
Cardinal: the difference between elements always mean the same thing.
Discrete: E.g., 2-1=3-2.
Continuous: E.g., 2.11-1.4444=3.11-2.4444
Factor: the difference between elements does not always mean the same thing.
Ordered: E.g., First place - Second place ?? Second place - Third place.
Unordered (categorical): E.g., A - B ????
Here are some examples
Code
d1d <-1:3# Cardinal data (Discrete)d1d## [1] 1 2 3#class(d1d)d1c <-c(1.1, 2/3, 3) # Cardinal data (Continuous)d1c## [1] 1.1000000 0.6666667 3.0000000#class(d1c)d2o <-factor(c('A','B','C'), ordered=T) # Factor data (Ordinal)d2o## [1] A B C## Levels: A < B < C#class(d2o)d2c <-factor(c('Leipzig','Los Angeles','Logan'), ordered=F) # Factor data (Categorical)d2c## [1] Leipzig Los Angeles Logan ## Levels: Leipzig Logan Los Angeles#class(d2c)
Note that for theoretical analysis, the types are sometimes grouped differently as
continuous (continuous cardinal data)
discrete (discrete cardinal, ordered factor, and unordered factor data)
Other Types.
R also allows for more unstructured data types, such as strings and lists. You often combine all of the different data types into a single dataset called a data.frame
Code
c('hello world', 'hi mom') # character strings## [1] "hello world" "hi mom"list(d1c, d2c) # lists## [[1]]## [1] 1.1000000 0.6666667 3.0000000## ## [[2]]## [1] Leipzig Los Angeles Logan ## Levels: Leipzig Logan Los Angeles# data.frames: your most common data type# matrix of different data-types# well-ordered listsd0 <-data.frame(y=d1c, x=d2c)d0## y x## 1 1.1000000 Leipzig## 2 0.6666667 Los Angeles## 3 3.0000000 Logan
Note that strings are encounter in a variety of settings, and you often have to format them after reading them into R.1
Code
# Stringspaste( 'hi', 'mom')## [1] "hi mom"paste( c('hi', 'mom'), collapse='--')## [1] "hi--mom"list(d1c, c('hello world'),list(d1d, list('...inception...'))) # lists## [[1]]## [1] 1.1000000 0.6666667 3.0000000## ## [[2]]## [1] "hello world"## ## [[3]]## [[3]][[1]]## [1] 1 2 3## ## [[3]][[2]]## [[3]][[2]][[1]]## [1] "...inception..."kingText <-"The king infringes the law on playing curling."gsub(pattern="ing", replacement="", kingText)## [1] "The k infres the law on play curl."# advanced usage#gsub("[aeiouy]", "_", kingText)#gsub("([[:alpha:]]{3})ing\\b", "\\1", kingText)
To further examine a particular variable, we look at its distribution. In what follows, we will denote the data for a single variable as \(\{X_{i}\}_{i=1}^{N}\), where there are \(N\) observations and \(X_{i}\) is the value of the \(i\)th one.
Histogram Density Estimate.
The histogram divides the range of \(\{X_{i}\}_{i=1}^{N}\) into \(L\) exclusive bins of equal-width \(h=[\text{max}(X_{i}) - \text{min}(X_{i})]/L\), and counts the number of observations within each bin. We often scale the counts to interpret the numbers as a density. Mathematically, for an exclusive bin with midpoint \(x\), we compute \[\begin{eqnarray}
\widehat{f}_{HIST}(x) &=& \frac{ \sum_{i}^{N} \mathbf{1}\left( X_{i} \in \left[x-\frac{h}{2}, x+\frac{h}{2} \right) \right) }{N h}.
\end{eqnarray}\] We compute \(\widehat{f}_{HIST}(x)\) for each \(x \in \left\{ \frac{\ell h}{2} + \text{min}(X) \right\}_{\ell=1}^{L}\).
Code
hist(USArrests$Murder, freq=F,border=NA, main='', xlab='Murder Arrests')# Raw Observationsrug(USArrests$Murder, col=grey(0,.5))
Note that if you your data are factor data, or discrete cardinal data, you can directly plot the counts.
Code
x <-floor(USArrests$Murder) #Discretizedplot(table(x), xlab='Murder Rate (Discrete)', ylab='Count')
Empirical Cumulative Distribution Function.
The ECDF counts the proportion of observations whose values \(X_{i}\) are less than \(x\); \[\begin{eqnarray}
\widehat{F}_{ECDF}(x) = \frac{1}{N} \sum_{i}^{N} \mathbf{1}(X_{i} \leq x)
\end{eqnarray}\] for each unique value of \(x\) in the dataset.
Code
F_murder <-ecdf(USArrests$Murder)# proportion of murders < 10F_murder(10)## [1] 0.7# proportion of murders < x, for all xplot(F_murder, main='', xlab='Murder Arrests',pch=16, col=grey(0,.5))
Boxplots.
Boxplots summarize the distribution of data using quantiles: the \(q\)th quantile is the value where \(q\) percent of the data are below and (\(1-q\)) percent are above.
The “median” is the point where half of the data has lower values and the other half has higher values.
The “lower quartile” is the point where 25% of the data has lower values and the other 75% has higher values.
The “min” is the smallest value (or largest negative value if there are any) where 0% of the data has lower values.
To compute quantiles, we sort the observations from smallest to largest as \(X_{(1)}, X_{(2)},... X_{(N)}\), and then compute quantiles as \(X_{ (q*N) }\). Note that \((q*N)\) is rounded and there are different ways to break ties.
The boxplot shows the median (solid black line) and interquartile range (\(IQR=\) upper quartile \(-\) lower quartile; filled box),2 as well extreme values as outliers beyond the \(1.5\times IQR\) (points beyond whiskers).
Code
boxplot(USArrests$Murder, main='', ylab='Murder Arrests')# Raw Observationsstripchart(USArrests$Murder,pch='-', col=grey(0,.5), cex=2,vert=T, add=T)
3.3 Joint Distributions
Scatterplots are used frequently to summarize the joint relationship between two variables. They can be enhanced in several ways. As a default, use semi-transparent points so as not to hide any points (and perhaps see if your observations are concentrated anywhere).
You can also add regression lines (and confidence intervals), although I will defer this until later.
It is easy to show how distributions change according to a third variable using data splits. E.g.,
Code
# Tailored Histogram ylim <-c(0,8)xbks <-seq(min(USArrests$Murder)-1, max(USArrests$Murder)+1, by=1)# Also show more information# Split Data by Urban Population above/below meanpop_mean <-mean(USArrests$UrbanPop)murder_lowpop <- USArrests[USArrests$UrbanPop< pop_mean,'Murder']murder_highpop <- USArrests[USArrests$UrbanPop>= pop_mean,'Murder']cols <-c(low=rgb(0,0,1,.75), high=rgb(1,0,0,.75))par(mfrow=c(1,2))hist(murder_lowpop,breaks=xbks, col=cols[1],main='Urban Pop >= Mean', font.main=1,xlab='Murder Arrests',border=NA, ylim=ylim)hist(murder_highpop,breaks=xbks, col=cols[2],main='Urban Pop < Mean', font.main=1,xlab='Murder Arrests',border=NA, ylim=ylim)
It is sometimes it is preferable to show the ECDF instead. And you can glue various combinations together to convey more information all at once
Code
par(mfrow=c(1,2))# Full Sample Densityhist(USArrests$Murder, main='Density Function Estimate', font.main=1,xlab='Murder Arrests',breaks=xbks, freq=F, border=NA)# Split Sample Distribution ComparisonF_lowpop <-ecdf(murder_lowpop)plot(F_lowpop, col=cols[1],pch=16, xlab='Murder Arrests',main='Distribution Function Estimates',font.main=1, bty='n')F_highpop <-ecdf(murder_highpop)plot(F_highpop, add=T, col=cols[2], pch=16)legend('bottomright', col=cols,pch=16, bty='n', inset=c(0,.1),title='% Urban Pop.',legend=c('Low (<= Mean)','High (>= Mean)'))
You can also split data into grouped boxplots in the same way
Code
layout( t(c(1,2,2)))boxplot(USArrests$Murder, main='',xlab='All Data', ylab='Murder Arrests')# K Groups with even spacingK <-3USArrests$UrbanPop_Kcut <-cut(USArrests$UrbanPop,K)Kcols <-hcl.colors(K,alpha=.5)boxplot(Murder~UrbanPop_Kcut, USArrests,main='', col=Kcols,xlab='Urban Population', ylab='')
Code
# 4 Groups with equal numbers of observations#Qcuts <- c(# '0%'=min(USArrests$UrbanPop)-10*.Machine$double.eps,# quantile(USArrests$UrbanPop, probs=c(.25,.5,.75,1)))#USArrests$UrbanPop_cut <- cut(USArrests$UrbanPop, Qcuts)#boxplot(Murder~UrbanPop_cut, USArrests, col=hcl.colors(4,alpha=.5))
You can also use size, color, and shape to further distinguish different conditional relationships.
Code
# High Assault Areasassault_high <- USArrests$Assault >median(USArrests$Assault)cols <-ifelse(assault_high, rgb(1,0,0,.5), rgb(0,0,1,.5))# Scatterplot# Show High Assault Areas via 'cex=' or 'pch='# Could further add regression lines for each data splitplot(Murder~UrbanPop, USArrests, pch=16, col=cols)
We will not cover the statistical analysis of text in this course, but strings are amenable to statistical analysis.↩︎
Technically, the upper and lower ``hinges’’ use two different versions of the first and third quartile. See https://stackoverflow.com/questions/40634693/lower-and-upper-quartiles-in-boxplot-in-r↩︎