17 Why?

You should make your work reproducible, and we will cover some of the basics of how to do this in R. You also want your work to be replicable

Replicable: someone collecting new data comes to the same results.
Reproducibile: someone reusing your data comes to the same results.

You can read more about the distinction in many places, including

My main sell to you is that being reproducible is in your own self-interest.

17.1 An example workflow

Taking First Steps …

Step 1: Some Ideas and Data

\(X_{1} \to Y_{1}\)

You copy some data into a spreadsheet, manually aggregate
do some calculations and tables the same spreadsheet
some other analysis from here and there, using this software and that.

Step 2: Pursuing the lead for a week or two

you extend your dataset with more observations
copy in a spreadsheet data, manually aggregate
do some more calculations and tables, same as before

Then, a Little Way Down the Road …

1 month later, someone asks about another factor: \(X_{2}\)

you download some other type of data
You repeat Step 2 with some data on \(X_{2}\).
The details from your “point and click” method are a bit fuzzy.
It takes a little time, but you successfully redo the analysis.

4 months later, someone asks about another factor: \(X_{3}\to Y_{1}\)

You again repeat Step 2 with some data on \(X_{3}\).
You’re pretty sure none of tables your tried messed up the order of the rows or columns.
It takes more time and effort. The data processing was not transparent, but you eventually redo the analysis.

6 months later, you want to explore: \(X_{2} \to Y_{2}\).

You found out Excel had some bugs in it’s statistical calculations (see e.g., https://biostat.app.vumc.org/wiki/pub/Main/TheresaScott/StatsInExcel.TAScot.handout.pdf). You now use a new version of the spreadsheet
You’re not sure you merged everything correctly. After much time and effort, most (but not all) of the numbers match exactly.

2 years later, you want to replicate: \(\{ X_{1}, X_{2}, X_{3} \} \to Y_{1}\)

A rival has proposed something new. Their idea doesn’t actually make any sense, but their figures and statistics look better.
You don’t even use that computer anymore and a collaborator who handled the data on \(X_{2}\) has moved on.

17.2 An alternative workflow

Suppose you decided to code what you did beginning with Step 2.

It does not take much time to update or replicate your results.

Your computer runs for 2 hours and reproduces the figures and tables.
You also rewrote your big calculations to use multiple cores, this took two hours to do but saved 6 hours each time you rerun your code.
You add some more data. It adds almost no time to see whether much has changed.

Your results are transparent and easier to build on.

You see the exact steps you took and found an error
- glad you found it before publication! See https://retractionwatch.com/ and https://econjwatch.org/
- Google “worst excell errors” and note the frequency they arise from copy/paste via the “point-and-click” approach. Future economists should also read https://core.ac.uk/download/pdf/300464894.pdf.
You try out a new plot you found in The Visual Display of Quantitative Information, by Edward Tufte.
- It’s not a standard plot, but google answers most of your questions.
- Tutorials help avoid bad practices, such as plotting 2D data as a 3D object (see e.g., https://clauswilke.com/dataviz/no-3d.html).
You try out an obscure statistical approach that’s hot in your field.
- it doesn’t make the paper, but you have some confidence that candidate issue isn’t a big problem

Note that R (and Rmarkdown) is a good choice to address this issue

http://www.r-bloggers.com/the-reproducibility-crisis-in-science-and-prospects-for-r/
http://fmwww.bc.edu/GStat/docs/pointclick.html
https://github.com/qinwf/awesome-R\#reproducible-research
A Guide to Reproducible Code in Ecology and Evolution
https://biostat.app.vumc.org/wiki/pub/Main/TheresaScott/ReproducibleResearch.TAScott.handout.pdf

17.3 Task Views

Task views list relevant packages.

For all students and early researchers,

https://cran.r-project.org/web/views/ReproducibleResearch.html

For microeconometrics,

https://cran.r-project.org/web/views/Econometrics.html

For spatial econometrics

Multiple packages may have the same function name for different commands. In this case use the syntax package::function to specify the package. For example

devtools::install_github
remotes::install_github

Don’t fret Sometimes there is not a specific package for your data.

Odds are, you can do most of what you want with base code.

Packages just wrap base code in convient formats
see https://cran.r-project.org/web/views/ for topical overviews

Statisticians might have different naming conventions

if the usual software just spits out a nice plot you might have to dig a little to know precisely what you want
your data are fundamentally numbers, strings, etc… You only have to figure out how to read it in.