8 Data Analysis – Introductory Economic Statistics: A Data-Driven Approach in R

8.1 Beyond Basics

Use expansion “packages” for less common procedures and more functionality

CRAN.

Most packages can be found on CRAN and can be easily installed

Code

# commonly used packages
install.packages("stargazer")
install.packages("data.table")
install.packages("plotly")
# other statistical packages
install.packages("extraDistr")
install.packages("twosamples")
# install.packages("purrr")
# install.packages("reshape2")

The most common tasks also have cheatsheets you can use.

For example, to generate ‘exotic’ probability distributions

Code

library(extraDistr)

par(mfrow=c(1,2))
for(p in c(-.5,0)){
    x <- rgev(2000, mu=0, sigma=1, xi=p)
    hist(x, breaks=50, border=NA, main=NA, freq=F)
}
title('GEV densities', outer=T, line=-1)

Code

library(extraDistr)

par(mfrow=c(1,3))
for(p in c(-1, 0,2)){
    x <- rtlambda(2000, p)
    hist(x, breaks=100, border=NA, main=NA, freq=F)
}
title('Tukey-Lambda densities', outer=T, line=-1)

Github.

Sometimes you will want to install a package from GitHub. For this, you can use devtools or its light-weight version remotes

Code

install.packages("devtools")
install.packages("remotes")

Note that to install devtools, you also need to have developer tools installed on your computer.

Windows: Rtools
Mac: Xcode

To color terminal output on Linux systems, you can use the colorout package

Code

library(remotes)
# Install <https://github.com/jalvesaq/colorout
# to .libPaths()[1]
install_github('jalvesaq/colorout')
library(colorout)

Base.

While additional packages can make your code faster, they also create dependancies that can lead to problems. So learn base R well before becoming dependant on other packages

Updating.

Make sure R and your packages are up to date. The current version of R and any packages used can be found (and recorded) with

Code

sessionInfo()

To update your R packages, use

Code

update.packages()

Rare Tricks.

Note that after updating R, you can update all packages stored in all .libPaths() with the following command

Code

update.packages(checkBuilt=T, ask=F)
# install.packages(old.packages(checkBuilt=T)[,"Package"])

Sometimes there is a problem. To find specific broken packages after an update

Code

library(purrr)

set_names(.libPaths()) %>%
  map(function(lib) {
    .packages(all.available = TRUE, lib.loc = lib) %>%
        keep(function(pkg) {
            f <- system.file('Meta', 'package.rds', package = pkg, lib.loc = lib)
            tryCatch({readRDS(f); FALSE}, error = function(e) TRUE)
        })
  })
# https://stackoverflow.com/questions/31935516/installing-r-packages-error-in-readrdsfile-error-reading-from-connection/55997765

To remove packages duplicated in multiple libraries

Code

# Libraries
i <- installed.packages()
libs <- .libPaths()
# Find Duplicated Packages
i1 <- i[ i[,'LibPath']==libs[1], ]
i2 <- i[ i[,'LibPath']==libs[2], ]
dups <- i2[,'Package'] %in% i1[,'Package']
all( dups )
# Remove
remove.packages(  i2[,'Package'], libs[2] )

8.2 Inputs

Reading Data.

The first step in data analysis is getting data into R. There are many ways to do this, depending on your data structure. Perhaps the most common case is reading in a csv file.

Code

# Read in csv (downloaded from online)
# download source 'http://www.stern.nyu.edu/~wgreene/Text/Edition7/TableF19-3.csv'
# download destination '~/TableF19-3.csv'
read.csv('~/TableF19-3.csv')
 
# Can read in csv (directly from online)
# dat_csv <- read.csv('http://www.stern.nyu.edu/~wgreene/Text/Edition7/TableF19-3.csv')

Reading in other types of data can require the use of “packages”. For example, the “wooldridge” package contains datasets on crime. To use this data, we must first install the package on our computer. Then, to access the data, we must first load the package.

Code

# Install R Data Package and Load in
install.packages('wooldridge') # only once
library(wooldridge) # anytime you want to use the data

data('crime2') 
data('crime4')

We can use packages to access many different types of data. To read in a Stata data file, for example, we can use the “haven” package.

Code

# Read in stata data file from online
#library(haven)
#dat_stata <- read_dta('https://www.ssc.wisc.edu/~bhansen/econometrics/DS2004.dta')
#dat_stata <- as.data.frame(dat_stata)

# For More Introductory Econometrics Data, see 
# https://www.ssc.wisc.edu/~bhansen/econometrics/Econometrics%20Data.zip
# https://pages.stern.nyu.edu/~wgreene/Text/Edition7/tablelist8new.htm
# R packages: wooldridge, causaldata, Ecdat, AER, ....

Cleaning Data.

Data transformation is often necessary before analysis, so remember to be careful and check your code is doing what you want. (If you have large datasets, you can always test out the code on a sample.)

Code

# Function to Create Sample Datasets
make_noisy_data <- function(n, b=0){
    # Simple Data Generating Process
    x <- seq(1,10, length.out=n) 
    e <- rnorm(n, mean=0, sd=10)
    y <- b*x + e 
    # Obervations
    xy_mat <- data.frame(ID=seq(x), x=x, y=y)
    return(xy_mat)
}

# Two simulated datasets
dat1 <- make_noisy_data(6)
dat2 <- make_noisy_data(6)

# Merging data in long format
dat_merged_long <- rbind(
    cbind(dat1,DF=1),
    cbind(dat2,DF=2))

Now suppose we want to transform into wide format

Code

# Merging data in wide format, First Attempt
dat_merged_wide <- cbind( dat1, dat2)
names(dat_merged_wide) <- c(paste0(names(dat1),'.1'), paste0(names(dat2),'.2'))

# Merging data in wide format, Second Attempt
# higher performance
dat_merged_wide2 <- merge(dat1, dat2,
    by='ID', suffixes=c('.1','.2'))
## CHECK they are the same.
identical(dat_merged_wide, dat_merged_wide2)
## [1] FALSE
# Inspect any differences

# Merging data in wide format, Third Attempt with dedicated package
# (highest performance but with new type of object)
library(data.table)
dat_merged_longDT <- as.data.table(dat_merged_long)
dat_melted <- melt(dat_merged_longDT, id.vars=c('ID', 'DF'))
dat_merged_wide3 <- dcast(dat_melted, ID~DF+variable)

## CHECK they are the same.
identical(dat_merged_wide, dat_merged_wide3)
## [1] FALSE

Often, however, we ultimately want data in long format

Code

# Merging data in long format, Second Attempt with dedicated package 
dat_melted2 <- melt(dat_merged_wide3, measure=c("1_x","1_y","2_x","2_y"))
melt_vars <- strsplit(as.character(dat_melted2[['variable']]),'_')
dat_melted2[,'DF'] <- sapply(melt_vars, `[[`,1)
dat_melted2[,'variable'] <- sapply(melt_vars, `[[`,2)
dat_merged_long2 <- dcast(dat_melted2, DF+ID~variable)
dat_merged_long2 <- as.data.frame(dat_merged_long2)

## CHECK they are the same.
identical( dat_merged_long2, dat_merged_long)
## [1] FALSE

# Further Inspect
dat_merged_long2 <- dat_merged_long2[,c('ID','x','y','DF')]
mapply( identical, dat_merged_long2, dat_merged_long)
##    ID     x     y    DF 
##  TRUE  TRUE  TRUE FALSE

For more tips, see https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf and https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html

8.3 Outputs

Polishing.

Your first figures are typically standard.

Code

# Random Data
x <- seq(1, 10, by=.0002)
e <- rnorm(length(x), mean=0, sd=1)
y <- .25*x + e 

# First Drafts
# qqplot(x, y)
# plot(x, y)

Edit your plot to focus on the most useful information. For others to easily comprehend your work, you must also polish the plot.

Code

# Second Draft: Focus
# (In this example: comparing shapes)
xs <- scale(x)
ys <- scale(y)
# qqplot(xs, ys)

# Third Draft: Polish
qqplot(ys, xs, 
    xlab=expression('['~X-bar(X)~'] /'~s[X]),
    ylab=expression('['~Y-bar(Y)~'] /'~s[Y]),
    pch=16, cex=.5, col=grey(0,.2))
abline(a=0, b=1, lty=2)

When polishing, you must do two things

Add details that are necessary to understand the figure
Remove unnecessary details (see e.g., https://www.edwardtufte.com/notebook/chartjunk/ and https://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/)

Code

# Another Example
xy_dat <- data.frame(x=x, y=y)
par(fig=c(0,1,0,0.9), new=F)
plot(y~x, xy_dat, pch=16, col=rgb(0,0,0,.05), cex=.5,
    xlab='', ylab='') # Format Axis Labels Seperately
mtext( 'y=0.25 x + e\n e ~ standard-normal', 2, line=2.2)
mtext( expression(x%in%~'[0,10]'), 1, line=2.2)
#abline( lm(y~x, data=xy_dat), lty=2)
title('Plot with good features, but too excessive in several ways',
    adj=0, font.main=1)

# Outer Legend (https://stackoverflow.com/questions/3932038/)
outer_legend <- function(...) {
  opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), 
    mar=c(0, 0, 0, 0), new=TRUE)
  on.exit(par(opar))
  plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
  legend(...)
}
outer_legend('topright', legend='single data point',
    title='do you see the normal distribution?',
    pch=16, col=rgb(0,0,0,.1), cex=1, bty='n')

For useful tips, see C. Wilke (2019) “Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures” https://clauswilke.com/dataviz/

Saving.

You can export figures with specific dimensions

Code

pdf( 'Figures/plot_example.pdf', height=5, width=5)
# plot goes here
dev.off()

For plotting math, see https://astrostatistics.psu.edu/su07/R/html/grDevices/html/plotmath.html and https://library.virginia.edu/data/articles/mathematical-annotation-in-r

For exporting options, see ?pdf. For saving other types of files, see png("*.png"), tiff("*.tiff"), and jpeg("*.jpg")

Which features are most informative depends on what you want to show, and you can always mix and match. Learn to edit your figures:

and be aware that each type has benefits and costs. E.g., see

For small datasets, you can plot individual data points with a strip chart. For datasets with spatial information, a map is also helpful. Sometime tables are better than graphs (see https://www.edwardtufte.com/notebook/boxplots-data-test)

Interactive Figures.

Notably, histograms and boxplots

Code

pop_mean <- mean(USArrests[,'UrbanPop'])
pop_cut <- USArrests[,'UrbanPop'] < pop_mean
murder_lowpop <- USArrests[ pop_cut,'Murder']
murder_highpop <- USArrests[ !pop_cut,'Murder']

library(plotly)
fig <- plot_ly(alpha=0.6, 
    hovertemplate="%{y}")
fig <- fig %>% add_histogram(murder_lowpop, name='Low Pop. (< Mean)')
fig <- fig %>% add_histogram(murder_highpop, name='High Pop (>= Mean)')
fig <- fig %>% layout(barmode="stack") # barmode="overlay"
fig <- fig %>% layout(
    title="Crime and Urbanization in America 1975",
    xaxis = list(title='Murders Arrests per 100,000 People'),
    yaxis = list(title='Number of States'),
    legend=list(title=list(text='<b> % Urban Pop. </b>'))
)
fig

Code

USArrests[,'ID'] <- rownames(USArrests)
fig <- plot_ly(USArrests,
    y=~Murder, color=~cut(UrbanPop,4),
    alpha=0.6, type="box",
    pointpos=0, boxpoints = 'all',
    hoverinfo='text',    
    text = ~paste('<b>', ID, '</b>',
        "<br>Urban  :", UrbanPop,
        "<br>Assault:", Assault,
        "<br>Murder :", Murder))    
fig <- layout(fig,
    showlegend=FALSE,
    title='Crime and Urbanization in America 1975',
    xaxis = list(title = 'Percent of People in an Urban Area'),
    yaxis = list(title = 'Murders Arrests per 100,000 People'))
fig

Tables.

You can also export tables in a variety of formats, for other software programs to easily read

Code

library(stargazer)
# summary statistics
stargazer(USArrests,
    type='html', 
    summary=T,
    title='Summary Statistics for USArrests')

**Summary Statistics for USArrests**

Statistic	N	Mean	St. Dev.	Min	Max

Murder	50	7.788	4.356	0.800	17.400
Assault	50	170.760	83.338	45	337
UrbanPop	50	65.540	14.475	32	91
Rape	50	21.232	9.366	7.300	46.000

You can create a basic interactive table to explore raw data.

Code

data("USArrests")
library(reactable)
reactable(USArrests, filterable=T, highlight=T)

For further data exploration, your plots can also be made interactive via https://plotly.com/r/. For more details, see examples and then applications.

Code

#install.packages("plotly")
library(plotly)

Custom Figures.

Many of the best plots are custom made (see https://www.r-graph-gallery.com/). Here are some ones that I have made over the years.

8.4 R-Markdown Reports

We will use R Markdown for communicating results to each other. Note that R and R Markdown are both languages. R studio interprets R code make statistical computations and interprets R Markdown code to produce pretty documents that contain both writing and statistics. Altogether, your project will use

R: does statistical computations
R Markdown: formats statistical computations for sharing
Rstudio: graphical user interface that allows you to easily use both R and R Markdown.

Homework reports are probably the smallest document you can create. These little reports are almost entirely self-contained (showing both code and output). To make them, you will need to

First install Pandoc on your computer.

Then install any required packages

Code

# Packages for Rmarkdown
install.packages("knitr")
install.packages("rmarkdown")

# Other packages frequently used
#install.packages("plotly") #for interactive plots
#install.packages("sf") #for spatial data

We will create simple reproducible reports via R Markdown.

Example 1: Data Scientism.

See DataScientism.html and then create it by

Clicking the “Code” button in the top right and then “Download Rmd”
Open with Rstudio
Change the name and title to your own, make other edits
Then point-and-click “knit”

Alternatively,

Download the source file from DataScientism.Rmd
Change the name and title to your own, make other edits
Use the console to run

Code

rmarkdown::render('DataScientism.Rmd')

Example 2: Homework Assignment.

Below is a template of what homework questions (and answers) look like. Create a new .Rmd file from scratch and produce a .html file that looks similar to this:

Problem: Simulate 100 random observations of the form \(y=x\beta+\epsilon\) and plot the relationship. Plot and explore the data interactively via plotly, https://plotly.com/r/line-and-scatter/. Then play around with different styles, https://www.r-graph-gallery.com/13-scatter-plot.html, to best express your point.

Solution: I simulate \(400\) observations for \(\epsilon \sim 2\times N(0,1)\) and \(\beta=4\), as seen in this single chunk. Notice an upward trend.

Code

# Simulation
n <- 100
E <- rnorm(n)
X <- seq(n)
Y <- 4*X + 2*E
# Plot
library(plotly)
dat <- data.frame(X=X,Y=Y)
plot_ly( data=dat, x=~X, y=~Y)

Code


# To Do:
# 1. Fit a regression line
# 2. Color points by their residual value

8.5 Further Reading

For more guidance on how to create Rmarkdown documents, see

https://github.com/rstudio/cheatsheets/blob/main/rmarkdown.pdf
https://cran.r-project.org/web/packages/rmarkdown/vignettes/rmarkdown.html
http://rmarkdown.rstudio.com
https://bookdown.org/yihui/rmarkdown/
https://bookdown.org/yihui/rmarkdown-cookbook/
https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/rmarkdown.html
An Introduction to the Advanced Theory and Practice of Nonparametric Econometrics. Raccine 2019. Appendices B & D.
https://rmd4sci.njtierney.com/using-rmarkdown.html
https://alexd106.github.io/intro2R/Rmarkdown_intro.html

If you are still lost, try one of the many online tutorials (such as these)