1 First Steps
1.1 Why Program in R?
You should program your statistical analysis, and we will cover some of the basics of how to do this in R. You also want your work to be replicable
- Replicable: someone collecting new data comes to the same results.
- Reproducibile: someone reusing your data comes to the same results.
You can read more about the distinction in many places, including
- https://www.annualreviews.org/doi/10.1146/annurev-psych-020821-114157
- https://nceas.github.io/sasap-training/materials/reproducible_research_in_r_fairbanks/
We focus on R because it is good for complex stats, concise figures, and coherent organization. It is built and developed by applied statisticians for statistics, and used by many in academia and industry. For students, think about labor demand and what may be good for getting a job. Do some of your own research to best understand how much to invest.
My main sell to you is that being reproducible is in your own self-interest.
An example workflow.
First Steps…
Step 1: Some ideas and data about how variable \(X_{1}\) affects \(Y_{1}\)
- You copy some data into a spreadsheet, manually aggregate
- do some calculations and tables the same spreadsheet
- some other analysis from here and there, using this software and that.
Step 2: Pursuing the lead for a week or two
- you extend your dataset with more observations
- copy in a spreadsheet data, manually aggregate
- do some more calculations and tables, same as before
A Little Way Down the Road …
1 month later, someone asks about another factor: \(X_{2}\)
- you download some other type of data
- You repeat Step 2 with some data on \(X_{2}\).
- The details from your “point and click” method are a bit fuzzy.
- It takes a little time, but you successfully redo the analysis.
4 months later, someone asks about another factor: \(X_{3}\to Y_{1}\)
- You again repeat Step 2 with some data on \(X_{3}\).
- You’re pretty sure none of tables your tried messed up the order of the rows or columns.
- It takes more time and effort. The data processing was not transparent, but you eventually redo the analysis.
6 months later, you want to explore: \(X_{2} \to Y_{2}\).
- You found out Excel had some bugs in it’s statistical calculations (see e.g., https://biostat.app.vumc.org/wiki/pub/Main/TheresaScott/StatsInExcel.TAScot.handout.pdf). You now use a new version of the spreadsheet
- You’re not sure you merged everything correctly. After much time and effort, most (but not all) of the numbers match exactly.
2 years later, your boss wants you to replicate: \(\{ X_{1}, X_{2}, X_{3} \} \to Y_{1}\)
- A rival has proposed something new. Their idea doesn’t actually make any sense, but their figures and statistics look better.
- You don’t even use that computer anymore and a collaborator who handled the data on \(X_{2}\) has moved on.
An alternative workflow.
Suppose you decided to code what you did beginning with Step 2.
It does not take much time to update or replicate your results.
- Your computer runs for 2 hours and reproduces the figures and tables.
- You also rewrote your big calculations to use multiple cores, this took two hours to do but saved 6 hours each time you rerun your code.
- You add some more data. It adds almost no time to see whether much has changed.
Your results are transparent and easier to build on.
- You see the exact steps you took and found an error
- glad you found it before sending it out! See https://retractionwatch.com/ and https://econjwatch.org/
- Google “worst excell errors” and note the frequency they arise from copy/paste via the “point-and-click” approach. Future economists should also read https://core.ac.uk/download/pdf/300464894.pdf.
- You try out a new plot you found in The Visual Display of Quantitative Information, by Edward Tufte.
- It’s not a standard plot, but google answers most of your questions.
- Tutorials help avoid bad practices, such as plotting 2D data as a 3D object (see e.g., https://clauswilke.com/dataviz/no-3d.html).
- You try out an obscure statistical approach that’s hot in your field.
- it doesn’t make the report, but you have some confidence that candidate issue isn’t a big problem
1.2 First Steps
Install R
First Install R. Then Install Rstudio.
For help setting up, see any of the following links
- https://learnr-examples.shinyapps.io/ex-setup-r/
- https://rstudio-education.github.io/hopr/starting.html
- https://a-little-book-of-r-for-bioinformatics.readthedocs.io/en/latest/src/installr.html
- https://cran.r-project.org/doc/manuals/R-admin.html
- https://courses.edx.org/courses/UTAustinX/UT.7.01x/3T2014/56c5437b88fa43cf828bff5371c6a924/
- https://owi.usgs.gov/R/training-curriculum/installr/
- https://www.earthdatascience.org/courses/earth-analytics/document-your-science/setup-r-rstudio/
For Fedora users, note that you need to first enable the repo and then install
Code
Make sure you have the latest version of R and Rstudio for class. If not, then reinstall.
Interfacing with R
Rstudio is easiest to get going with. (There are other GUI’s.) There are 4 panes. The top left is where you write and save code
- Create and save a new
R Script
file My_First_Script.R - could also use a plain .txt file.
The pane below is where your code is executed. For all following examples, make sure to both execute and store your code.
Note that the coded examples generally have objects, functions, and comments.
1.3 Further Reading
There are many good and free programming materials online.
The most common tasks can be found https://github.com/rstudio/cheatsheets/blob/main/rstudio-ide.pdf
Some of my programming examples originally come from https://r4ds.had.co.nz/ and I recommend https://intro2r.com. I have also used online material from many places over the years, including
- https://cran.r-project.org/doc/manuals/R-intro.html
- R Graphics Cookbook, 2nd edition. Winston Chang. 2021. https://r-graphics.org/
- R for Data Science. H. Wickham and G. Grolemund. 2017. https://r4ds.had.co.nz/index.html
- An Introduction to R. W. N. Venables, D. M. Smith, R Core Team. 2017. https://colinfay.me/intro-to-r/
- Introduction to R for Econometrics. Kieran Marray. https://bookdown.org/kieranmarray/intro_to_r_for_econometrics/
- Wollschläger, D. (2020). Grundlagen der Datenanalyse mit R: eine anwendungsorientierte Einführung. http://www.dwoll.de/rexrepos/
- Spatial Data Science with R: Introduction to R. Robert J. Hijmans. 2021. https://rspatial.org/intr/index.html
What we cover in this primer should be enough to get you going. But there are also many good yet free-online tutorials and courses.
- https://www.econometrics-with-r.org/1.2-a-very-short-introduction-to-r-and-rstudio.html
- https://rafalab.github.io/dsbook/
- https://moderndive.com/foreword.html
- https://rstudio.cloud/learn/primers/1.2
- https://cran.r-project.org/manuals.html
- https://stats.idre.ucla.edu/stat/data/intro_r/intro_r_interactive_flat.html
- https://cswr.nrhstat.org/app-r
For more on why to program in R, see
- http://www.r-bloggers.com/the-reproducibility-crisis-in-science-and-prospects-for-r/
- http://fmwww.bc.edu/GStat/docs/pointclick.html
- https://github.com/qinwf/awesome-R\#reproducible-research
- A Guide to Reproducible Code in Ecology and Evolution
- https://biostat.app.vumc.org/wiki/pub/Main/TheresaScott/ReproducibleResearch.TAScott.handout.pdf