18 Data Analysis

18.1 Outputs

Interactive Figures.

Notably, histograms, boxplots, and scatterplots

Code

library(plotly) #install.packages('plotly')
USArrests[, 'ID'] <- rownames(USArrests)

# Scatter Plot
fig <- plot_ly(
    USArrests, x = ~ UrbanPop, y = ~ Assault,
    mode='markers',
    type='scatter',
    hoverinfo='text',
    marker=list( color='rgba(0, 0, 0, 0.5)'),
    text = ~ paste('<b>', ID, '</b>',
        '<br>Urban  :', UrbanPop,
        '<br>Assault:', Assault))
fig <- layout(fig,
    showlegend=F,
    title='Crime and Urbanization in America 1975',
    xaxis = list(title = 'Percent of People in an Urban Area'),
    yaxis = list(title = 'Assault Arrests per 100, 000 People'))
fig

Code

# Box Plot
fig <- plot_ly(USArrests,
    y= ~ Murder, color= ~ cut(UrbanPop, 4),
    alpha=0.6, type='box',
    pointpos=0, boxpoints = 'all',
    hoverinfo='text',    
    text = ~ paste('<b>', ID, '</b>',
        '<br>Urban  :', UrbanPop,
        '<br>Assault:', Assault,
        '<br>Murder :', Murder))    
fig <- layout(fig,
    showlegend=FALSE,
    title='Crime and Urbanization in America 1975',
    xaxis = list(title = 'Percent of People in an Urban Area'),
    yaxis = list(title = 'Murders Arrests per 100, 000 People'))
fig

Code

pop_mean <- mean(USArrests[, 'UrbanPop'])
pop_cut <- USArrests[, 'UrbanPop'] < pop_mean
murder_lowpop <- USArrests[ pop_cut, 'Murder']
murder_highpop <- USArrests[ !pop_cut, 'Murder']

# Overlapping Histograms
fig <- plot_ly(alpha=0.6, hovertemplate='%{y}')
fig <- add_histogram(fig, murder_lowpop, name='Low Pop. (< Mean)',
    histnorm = 'probability density',
    xbins = list(start=0, size=2))
fig <- add_histogram(fig, murder_highpop, name='High Pop (>= Mean)',
    histnorm = 'probability density',
    xbins = list(start=0, size=2))
fig <- layout(fig,
    barmode='overlay',
    title='Crime and Urbanization in America 1975',
    xaxis = list(title='Murders Arrests per 100, 000 People'),
    yaxis = list(title='Density'),
    legend=list(title=list(text='<b> % Urban Pop. </b>')) )
fig

Code


# Possible, but less preferable, to stack histograms
# barmode='stack', histnorm='count'

Note that many plots can be made interactive via https://plotly.com/r/. For more details, see examples and then applications.

If you have many points, for example, you can make a 2D histogram.

Code

library(plotly)
fig <- plot_ly(
    USArrests, x = ~ UrbanPop, y = ~ Assault)
fig <- add_histogram2d(fig, nbinsx=25, nbinsy=25)
fig

Interactive Tables.

You can create an interactive table to explore raw data.

Code

data('USArrests')
library(reactable)
reactable(USArrests, filterable=T, highlight=T)

You can create an interactive table that summarizes the data too.

Code

# Compute summary statistics
vars <- names(USArrests)
stats_list <- vector('list', length(vars))
for (i in seq_along(vars)) {
  v <- vars[i]
  x <- USArrests[[v]]
  stats_list[[i]] <- c(
    Variable = v,
    N       = sum(!is.na(x)),
    Mean    = mean(x, na.rm = TRUE),
    SD      = sd(x, na.rm = TRUE),
    Min     = min(x, na.rm = TRUE),
    Q1      = as.numeric(quantile(x, 0.25, na.rm = TRUE)),
    Median  = median(x, na.rm = TRUE),
    Q3      = as.numeric(quantile(x, 0.75, na.rm = TRUE)),
    Max     = max(x, na.rm = TRUE)
  )
}

# Convert list to data frame with numeric columns 
stats_df <- as.data.frame(do.call(rbind, stats_list), stringsAsFactors = FALSE)
num_cols <- setdiff(names(stats_df), 'Variable')
for (col in num_cols) {
    stats_df[[col]] <- round(as.numeric(stats_df[[col]]), 3)
}

# Display interactively
reactable(stats_df)

Polishing.

Your first figures are typically standard, and probably not as good as they should be. Edit your plot to focus on the most useful information. For others to easily comprehend your work, you must also polish the plot. When polishing, you must do two things:

Add details that are necessary.
Remove details that are not necessary.

Must Know

Data-Ink Ratio. Tufte (1983) defines the data-ink ratio as the share of ink in a figure devoted to displaying data. Non-data ink includes gridlines, borders, background shading, redundant labels, and decorative elements. A good figure has a high data-ink ratio: erase any element that does not help the reader understand the data. If removing something changes nothing about what the reader learns, it should go.

\[\text{data-ink ratio} = \frac{\text{ink used to display data}}{\text{total ink in the figure}}\]

Code

# Random Data
x <- seq(1, 10, by=.0002)
e <- rnorm(length(x), mean=0, sd=1)
y <- .25*x + e 

# First Draft
# plot(x, y)

# Second Draft: Focus
# (In this example: relationship magnitude)
xs <- scale(x)
ys <- scale(y)
plot(ys, xs,
    xlab='', ylab='',
    pch=16, cex=.5, col=grey(0, .05),
    main=NA)
mtext(expression('['~X[i]-hat(M)[X]~'] /'~hat(S)[X]), 1, line=2.5)
mtext(expression('['~Y[i]-hat(M)[Y]~'] /'~hat(S)[Y]), 2, line=2.5)
# Add a 45 degree line
abline(a=0, b=1, lty=2, col=rgb(1, 0, 0, .8))
legend('topleft',
    legend=c('data point', '45 deg. line'),
    pch=c(16, NA), lty=c(NA, 2), col=c(grey(0, .05), rgb(1, 0, 0, .8)),
    bty='n')
title('Standardized Relationship', font.main=1)

Code

# Another Example
xy_dat <- data.frame(x=x, y=y)
par(fig=c(0, 1, 0, 0.9), new=F)
plot(y ~ x, xy_dat, pch=16, col=grey(0, .05), cex=.5,
    main=NA, xlab='', ylab='') # Format Axis Labels Seperately
mtext( 'y=0.25 x + e\n e ~ standard-normal', 2, line=2.2)
mtext( expression(x%in%~'[0, 10]'), 1, line=2.2)
#abline( lm(y ~ x, data=xy_dat), lty=2)
title('Plot with good features, but too excessive in several ways',
    adj=0, font.main=1)

# Outer Legend (https://stackoverflow.com/questions/3932038/)
outer_legend <- function(...) {
  opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), 
    mar=c(0, 0, 0, 0), new=TRUE)
  on.exit(par(opar))
  plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
  legend(...)
}
outer_legend('topright', legend='single data point',
    title='do you see the normal distribution?',
    pch=16, col=rgb(0, 0, 0, .1), cex=1, bty='n')

Learn to edit your figures:

Which features are most informative depends on what you want to show, and you can always mix and match. Be aware that each type has benefits and costs. E.g., see

For small datasets, you can plot individual data points with a strip chart. For datasets with spatial information, a map is also helpful. Sometime tables are better than graphs (see https://www.edwardtufte.com/notebook/boxplots-data-test). For useful tips, see C. Wilke (2019) “Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures” https://clauswilke.com/dataviz/

For plotting math, which should be done very sparingly, see https://astrostatistics.psu.edu/su07/R/html/grDevices/html/plotmath.html and https://library.virginia.edu/data/articles/mathematical-annotation-in-r

Static Publishing.

Advanced and Optional

You can export figures with specific dimensions

Code

pdf( 'Figures/plot_example.pdf', height=5, width=5)
# plot goes here
dev.off()

For exporting options, see ?pdf. For saving other types of files, see png("*.png"), tiff("*.tiff"), and jpeg("*.jpg")

You can also export tables in a variety of formats, including many that other software programs can easily read

Code

library(stargazer)
# summary statistics
stargazer(USArrests,
    type='html', 
    summary=T,
    title='Summary Statistics for USArrests')

**Summary Statistics for USArrests**

Statistic	N	Mean	St. Dev.	Min	Max

Murder	50	7.788	4.356	0.800	17.400
Assault	50	170.760	83.338	45	337
UrbanPop	50	65.540	14.475	32	91
Rape	50	21.232	9.366	7.300	46.000

Note that many of the best plots are custom made (see https://www.r-graph-gallery.com/). Here are some ones that I have made over the years.

18.2 Quarto Reports

We will use Quarto for communicating results to each other. Note that R and Quarto are both languages. R does statistical computations and Quarto formats those computations into polished documents that contain both writing and statistics. Altogether, your project will use

R: does statistical computations
Quarto: formats statistical computations for sharing (.qmd files)
RStudio: graphical user interface that allows you to easily use both R and Quarto

Homework reports are probably the smallest document you can create. They are simple reproducible reports that are almost entirely self-contained (showing both code and output). To make them, you need Quarto installed on your computer

RStudio (version 2022.07 or later) includes Quarto

Example.

Start by opening RStudio, then

Click “File > New File > Quarto Document”
Add your name and a title
Save the file as “COURSEID_Template_Lastname.qmd”
Click “Render”
Notice how the code in the source document corresponds to formatting in the output document

Now try each of the following steps, re-rendering the document after each step

Basic Formatting:
- Change the title to “My First Quarto Document”
- Change the bold word Render to italics: Render
- Create a new section header at the bottom called “Data Analysis”
Data Analysis:
- Make a simple scatterplot for Murder vs. UrbanPop in the USAarrests data
- Change the simple scatterplot into a polished interactive scatterplot
- Before the scatterplot add a reactable table for the raw data
- Write your own description of a) the dataset and b) the relationship between Murder and UrbanPop
Polishing
- Delete the default sections
- clean up your analysis (make sure the table,figure, and writing are high quality)
- Email the output, “COURSEID_Template_Lastname.html”, to yourself

18.3 Exercises

When polishing a figure, you must both add necessary details and remove unnecessary ones. Give one example of each for a scatterplot of Murder vs. UrbanPop in the USArrests dataset.
Using the USArrests dataset, compute summary statistics (mean, standard deviation, min, max) for all four variables. Present the results in a single data frame with one row per variable.
Using the USArrests dataset and the plotly library, write R code to create an interactive scatterplot of Murder (y-axis) vs. UrbanPop (x-axis). The hover text should display the state name, murder rate, and urban population percentage.

18.1 Outputs

Interactive Figures.

Interactive Tables.

Polishing.

Static Publishing.

18.2 Quarto Reports

Example.

18.3 Exercises

Further Reading.