10 Why be reproducible


Before hopping into reproducible programming, lets think about why. My main sell to you is that it is in your own self-interest.

10.1 An example workflow

Taking First Steps …

Step 1: Some Ideas and Data

\(X_{1} \to Y_{1}\)

  • You copy some data into a spreadsheet
  • do some calculations and tables the same spreadsheet
  • some other analysis from here and there, using this software and that.

Step 2: Persuing the lead for a week or two

  • you beef up the data you got
  • add some other types of data
  • copy in a spreadsheet data, manually aggregate
  • do some more calculations and tables, same as before

Then, a Little Way Down the Road …

1 month later, someone asks about another factor: \(X_{2}\)

  • You repeat Step 2 with some data on \(X_{2}\).
  • The details from your “point and click” method are a bit fuzzy.

It takes a little time, but you successfully redo the analysis.

4 months later, someone asks about another factor: \(X_{3}\to Y_{1}\)

  • You again repeat Step 2 with some data on \(X_{3}\).
  • You’re pretty sure
    • it’s the latest version of the spreadsheet.
    • none of tables your tried messed up the order of the rows or columns.

It takes more time – the data processing was not transparent.

6 months later, you want to explore: \(X_{2} \to Y_{2}\).

2 years later, you want to replicate: \(\{ X_{1}, X_{2}, X_{3} \} \to Y_{1}\)

  • A rival has proposed an alternative theory. Their idea doesn’t actually make any sense, but their visuals are better and statistics are more sophisticated.
  • You don’t even have that computer anymore.
  • A collaborator who handled the data on \(X_{2}\) has moved on.

10.2 An alternative workflow

Suppose you decided to code what you did beginning with Step 2.

It doesn’t take much time to update or replicate your results.

  • Your computer runs for 2 hours and reproduces the figures and tables. (You wrote your big calculations to use multiple cores and this saved 6 hours–each time.)
  • You decided to add some more data, and it adds almost no time.
  • You see the exact steps you took and found an error (glad you found it before publication!)

Your results are transparent and easier to build on.

  • You easily see that not much has changed with the new data.
  • You try out a new plot you found in The Visual Display of Quantitative Information, by Edward Tufte.
  • You try out an obscure statistical approach that’s hot in your field.
    • it doesn’t make the paper, but you have some confidence that candidate issue isn’t a big problem

10.3 R and R-Markdown

We will use R Markdown for reproducible research, which is a good choice:

Note that R and R markdown are both languages: R studio interprets R code to produce statistics, R studio interprets R markdown code to produce pretty documents which contain both writing and statistics. (You should already be a bit familiar with R, but not necessarily R Markdown.) Altogether, your project will use

  • R is our software
  • Rstudio is our GUI
  • R Markdown is our document

Both are good for teaching

Homework reports are the smallest and probably first document you create. We will create little homework reports using R markdown that are almost entirely self-contained (showing both code and output). To do this, you will need to install Pandoc on your computer.

Install any required packages

## Packages for Rmarkdown
install.packages("knitr")
install.packages("rmarkdown")
install.packages("bookdown")

## Other packages used in this primer
install.packages("plotly")
install.packages("sf")

To get started with R Markdown, you can first read and work through https://jadamso.github.io/Rbooks/small-scale-projects.html, and then recreate https://jadamso.github.io/Rbooks/small-scale-projects.html#a-homework-example yourself.