18  Data Analysis


18.1 Outputs

Interactive Figures.

Notably, histograms, boxplots, and scatterplots

Code
library(plotly) #install.packages('plotly')
USArrests[, 'ID'] <- rownames(USArrests)

# Scatter Plot
fig <- plot_ly(
    USArrests, x = ~ UrbanPop, y = ~ Assault,
    mode='markers',
    type='scatter',
    hoverinfo='text',
    marker=list( color='rgba(0, 0, 0, 0.5)'),
    text = ~ paste('<b>', ID, '</b>',
        '<br>Urban  :', UrbanPop,
        '<br>Assault:', Assault))
fig <- layout(fig,
    showlegend=F,
    title='Crime and Urbanization in America 1975',
    xaxis = list(title = 'Percent of People in an Urban Area'),
    yaxis = list(title = 'Assault Arrests per 100, 000 People'))
fig
Code
# Box Plot
fig <- plot_ly(USArrests,
    y= ~ Murder, color= ~ cut(UrbanPop, 4),
    alpha=0.6, type='box',
    pointpos=0, boxpoints = 'all',
    hoverinfo='text',    
    text = ~ paste('<b>', ID, '</b>',
        '<br>Urban  :', UrbanPop,
        '<br>Assault:', Assault,
        '<br>Murder :', Murder))    
fig <- layout(fig,
    showlegend=FALSE,
    title='Crime and Urbanization in America 1975',
    xaxis = list(title = 'Percent of People in an Urban Area'),
    yaxis = list(title = 'Murders Arrests per 100, 000 People'))
fig
Code
pop_mean <- mean(USArrests[, 'UrbanPop'])
pop_cut <- USArrests[, 'UrbanPop'] < pop_mean
murder_lowpop <- USArrests[ pop_cut, 'Murder']
murder_highpop <- USArrests[ !pop_cut, 'Murder']

# Overlapping Histograms
fig <- plot_ly(alpha=0.6, hovertemplate='%{y}')
fig <- add_histogram(fig, murder_lowpop, name='Low Pop. (< Mean)',
    histnorm = 'probability density',
    xbins = list(start=0, size=2))
fig <- add_histogram(fig, murder_highpop, name='High Pop (>= Mean)',
    histnorm = 'probability density',
    xbins = list(start=0, size=2))
fig <- layout(fig,
    barmode='overlay',
    title='Crime and Urbanization in America 1975',
    xaxis = list(title='Murders Arrests per 100, 000 People'),
    yaxis = list(title='Density'),
    legend=list(title=list(text='<b> % Urban Pop. </b>')) )
fig
Code

# Possible, but less preferable, to stack histograms
# barmode='stack', histnorm='count'

Note that many plots can be made interactive via https://plotly.com/r/. For more details, see examples and then applications.

If you have many points, for example, you can make a 2D histogram.

Code
library(plotly)
fig <- plot_ly(
    USArrests, x = ~ UrbanPop, y = ~ Assault)
fig <- add_histogram2d(fig, nbinsx=25, nbinsy=25)
fig

Interactive Tables.

You can create an interactive table to explore raw data.

Code
data('USArrests')
library(reactable)
reactable(USArrests, filterable=T, highlight=T)

You can create an interactive table that summarizes the data too.

Code
# Compute summary statistics
vars <- names(USArrests)
stats_list <- vector('list', length(vars))
for (i in seq_along(vars)) {
  v <- vars[i]
  x <- USArrests[[v]]
  stats_list[[i]] <- c(
    Variable = v,
    N       = sum(!is.na(x)),
    Mean    = mean(x, na.rm = TRUE),
    SD      = sd(x, na.rm = TRUE),
    Min     = min(x, na.rm = TRUE),
    Q1      = as.numeric(quantile(x, 0.25, na.rm = TRUE)),
    Median  = median(x, na.rm = TRUE),
    Q3      = as.numeric(quantile(x, 0.75, na.rm = TRUE)),
    Max     = max(x, na.rm = TRUE)
  )
}

# Convert list to data frame with numeric columns 
stats_df <- as.data.frame(do.call(rbind, stats_list), stringsAsFactors = FALSE)
num_cols <- setdiff(names(stats_df), 'Variable')
for (col in num_cols) {
    stats_df[[col]] <- round(as.numeric(stats_df[[col]]), 3)
}

# Display interactively
reactable(stats_df)

Polishing.

Your first figures are typically standard, and probably not as good as they should be. Edit your plot to focus on the most useful information. For others to easily comprehend your work, you must also polish the plot. When polishing, you must do two things:

  • Add details that are necessary.
  • Remove details that are not necessary.

Data-Ink Ratio. Tufte (1983) defines the data-ink ratio as the share of ink in a figure devoted to displaying data. Non-data ink includes gridlines, borders, background shading, redundant labels, and decorative elements. A good figure has a high data-ink ratio: erase any element that does not help the reader understand the data. If removing something changes nothing about what the reader learns, it should go.

\[\text{data-ink ratio} = \frac{\text{ink used to display data}}{\text{total ink in the figure}}\]

Code
# Random Data
x <- seq(1, 10, by=.0002)
e <- rnorm(length(x), mean=0, sd=1)
y <- .25*x + e 

# First Draft
# plot(x, y)

# Second Draft: Focus
# (In this example: relationship magnitude)
xs <- scale(x)
ys <- scale(y)
plot(ys, xs,
    xlab='', ylab='',
    pch=16, cex=.5, col=grey(0, .05),
    main=NA)
mtext(expression('['~X[i]-hat(M)[X]~'] /'~hat(S)[X]), 1, line=2.5)
mtext(expression('['~Y[i]-hat(M)[Y]~'] /'~hat(S)[Y]), 2, line=2.5)
# Add a 45 degree line
abline(a=0, b=1, lty=2, col=rgb(1, 0, 0, .8))
legend('topleft',
    legend=c('data point', '45 deg. line'),
    pch=c(16, NA), lty=c(NA, 2), col=c(grey(0, .05), rgb(1, 0, 0, .8)),
    bty='n')
title('Standardized Relationship', font.main=1)

Code
# Another Example
xy_dat <- data.frame(x=x, y=y)
par(fig=c(0, 1, 0, 0.9), new=F)
plot(y ~ x, xy_dat, pch=16, col=grey(0, .05), cex=.5,
    main=NA, xlab='', ylab='') # Format Axis Labels Seperately
mtext( 'y=0.25 x + e\n e ~ standard-normal', 2, line=2.2)
mtext( expression(x%in%~'[0, 10]'), 1, line=2.2)
#abline( lm(y ~ x, data=xy_dat), lty=2)
title('Plot with good features, but too excessive in several ways',
    adj=0, font.main=1)

# Outer Legend (https://stackoverflow.com/questions/3932038/)
outer_legend <- function(...) {
  opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), 
    mar=c(0, 0, 0, 0), new=TRUE)
  on.exit(par(opar))
  plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
  legend(...)
}
outer_legend('topright', legend='single data point',
    title='do you see the normal distribution?',
    pch=16, col=rgb(0, 0, 0, .1), cex=1, bty='n')

Learn to edit your figures:

Which features are most informative depends on what you want to show, and you can always mix and match. Be aware that each type has benefits and costs. E.g., see

For small datasets, you can plot individual data points with a strip chart. For datasets with spatial information, a map is also helpful. Sometime tables are better than graphs (see https://www.edwardtufte.com/notebook/boxplots-data-test). For useful tips, see C. Wilke (2019) “Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures” https://clauswilke.com/dataviz/

For plotting math, which should be done very sparingly, see https://astrostatistics.psu.edu/su07/R/html/grDevices/html/plotmath.html and https://library.virginia.edu/data/articles/mathematical-annotation-in-r

Static Publishing.

Advanced and Optional

You can export figures with specific dimensions

Code
pdf( 'Figures/plot_example.pdf', height=5, width=5)
# plot goes here
dev.off()

For exporting options, see ?pdf. For saving other types of files, see png("*.png"), tiff("*.tiff"), and jpeg("*.jpg")

You can also export tables in a variety of formats, including many that other software programs can easily read

Code
library(stargazer)
# summary statistics
stargazer(USArrests,
    type='html', 
    summary=T,
    title='Summary Statistics for USArrests')
Summary Statistics for USArrests
Statistic N Mean St. Dev. Min Max
Murder 50 7.788 4.356 0.800 17.400
Assault 50 170.760 83.338 45 337
UrbanPop 50 65.540 14.475 32 91
Rape 50 21.232 9.366 7.300 46.000

Note that many of the best plots are custom made (see https://www.r-graph-gallery.com/). Here are some ones that I have made over the years.

18.2 Quarto Reports

We will use Quarto for communicating results to each other. Note that R and Quarto are both languages. R does statistical computations and Quarto formats those computations into polished documents that contain both writing and statistics. Altogether, your project will use

  • R: does statistical computations
  • Quarto: formats statistical computations for sharing (.qmd files)
  • RStudio: graphical user interface that allows you to easily use both R and Quarto

Homework reports are probably the smallest document you can create. They are simple reproducible reports that are almost entirely self-contained (showing both code and output). To make them, you need Quarto installed on your computer

  • RStudio (version 2022.07 or later) includes Quarto

Example.

Start by opening RStudio, then

  • Click “File > New File > Quarto Document”
  • Add your name and a title
  • Save the file as “COURSEID_Template_Lastname.qmd”
  • Click “Render”
  • Notice how the code in the source document corresponds to formatting in the output document

Now try each of the following steps, re-rendering the document after each step

  • Basic Formatting:
    • Change the title to “My First Quarto Document”
    • Change the bold word Render to italics: Render
    • Create a new section header at the bottom called “Data Analysis”
  • Data Analysis:
    • Make a simple scatterplot for Murder vs. UrbanPop in the USAarrests data
    • Change the simple scatterplot into a polished interactive scatterplot
    • Before the scatterplot add a reactable table for the raw data
    • Write your own description of a) the dataset and b) the relationship between Murder and UrbanPop
  • Polishing
    • Delete the default sections
    • clean up your analysis (make sure the table,figure, and writing are high quality)
    • Email the output, “COURSEID_Template_Lastname.html”, to yourself

18.3 Exercises

  1. When polishing a figure, you must both add necessary details and remove unnecessary ones. Give one example of each for a scatterplot of Murder vs. UrbanPop in the USArrests dataset.

  2. Using the USArrests dataset, compute summary statistics (mean, standard deviation, min, max) for all four variables. Present the results in a single data frame with one row per variable.

  3. Using the USArrests dataset and the plotly library, write R code to create an interactive scatterplot of Murder (y-axis) vs. UrbanPop (x-axis). The hover text should display the state name, murder rate, and urban population percentage.

Further Reading.