16 Large Projects | Introductory Economic Statistics: Data-Driven Analysis in R

16.1 Scripting

Basics.

Save the following code as MyFirstScript.R

Code

# Define New Function
sum_squared <- function(x1, x2) {
    y <- (x1 + x2)^2
    return(y)
} 

# Test New Function
x <- c(0,1,3,10,6)
sum_squared(x[1], x[3])
sum_squared(x, x[2])
sum_squared(x, x[7])
sum_squared(x, x)

message('Script Completed')

Restart Rstudio.⁹

Replicate in another tab via

Code

source('MyFirstScript.R')

Note that you may first need to setwd() so your computer knows where you saved your code.¹⁰

After you get this working:

add a the line print(sum_squared(y, y)) to the bottom of MyFirstCode.R.
apply the function to a vectors specified outside of that script
record the session information

Code

# Pass Objects/Functions *to* Script
y <- c(3,1,NA)
source('MyFirstScript.R')

# Pass Objects/Functions *from* Script
z <- sqrt(y)/2
sum_squared(z,z)

# Report all information relevant for replication
sessionInfo()

Note that you can open a new terminal in RStudio in the top bar by clicking ‘tools > terminal > new terminal’

Logging/Sinking.

When executing the makefile, you can also log the output in three different ways:

Inserting some code into the makefile that “sinks” the output

Code

# Project Structure
home_dir    <- path.expand("~/Desktop/Project/")
data_dir_r  <- paste0(data_dir, "Data/Raw/")
data_dir_c  <- paste0(data_dir, "Data/Clean/")
out_dir     <- paste0(hdir, "Output/")
code_dir    <- paste0(hdir, "Code/")

# Log Output
set.wd( code_dir )
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)

# Execute Codes
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )
sessionInfo()

# Stop Logging Output
sink()

Starting a session that “sinks” the makefile

Code

sink("MAKEFILE.Rout", append=TRUE, split=TRUE)
source("MAKEFILE.R")
sink()

Execute the makefile via the commandline

Code

R CMD BATCH MAKEFILE.R MAKEFILE.Rout

16.2 Organizing

Project Structure.

Large sized projects should have their own Project folder on your computer with files, subdirectories with files, and subsubdirectories with files. It should look like this

Project
    └── README.txt
    └── /Code
        └── MAKEFILE.R
        └── RBLOCK_001_DataClean.R
        └── RBLOCK_002_Figures.R
        └── RBLOCK_003_ModelsTests.R
        └── RBLOCK_004_Robust.R
        └── /Logs
            └── MAKEFILE.Rout
    └── /Data
        └── /Raw
            └── Source1.csv
            └── Source2.shp
            └── Source3.txt
        └── /Clean
            └── AllDatasets.Rdata
            └── MainDataset1.Rds
            └── MainDataset2.csv
    └── /Output
        └── MainFigure.pdf
        └── AppendixFigure.pdf
        └── MainTable.tex
        └── AppendixTable.tex
    └── /Writing
        └── /TermPaper
            └── TermPaper.tex
            └── TermPaper.bib
            └── TermPaper.pdf
        └── /Slides
            └── Slides.Rmd
            └── Slides.html
            └── Slides.pdf
        └── /Poster
            └── Poster.Rmd
            └── Poster.html
            └── Poster.pdf
        └── /Proposal
            └── Proposal.Rmd
            └── Proposal.html
            └── Proposal.pdf

There are two main meta-files

README.txt overviews the project structure and what the codes are doing
MAKEFILE explicitly describes and executes all codes (and typically logs the output).

Class Projects. Zip your project into a single file that is easy for others to identify: Class_Project_LASTNAME_FIRSTNAME.zip

MAKEFILE.

If all code is written with the same program (such as R) the makefile can be written in a single language. For us, this looks like

Code

# Project Structure
home_dir    <- path.expand("~/Desktop/Project/")
data_dir_r  <- paste0(data_dir, "Data/Raw/")
data_dir_c  <- paste0(data_dir, "Data/Clean/")
out_dir     <- paste0(hdir, "Output/")
code_dir    <- paste0(hdir, "Code/")

# Execute Codes
# libraries are loaded within each RBLOCK
setwd( code_dir )
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )

# Report all information relevant for replication
sessionInfo()

Notice there is a lot of documentation # like this, which is crucial for large projects. Also notice that anyone should be able to replicate the entire project by downloading a zip file and simply changing home_dir.

If some folders or files need to be created, you can do this within R

Code

# list the files and directories
list.files(recursive=TRUE, include.dirs=TRUE)
# create directory called 'Data'
dir.create('Data')

16.3 Posters and Slides

Posters and presentations are another important type of scientific document. R markdown is good at creating both of these, and actually very good with some additional packages. So we will also use flexdashboard for posters and beamer for presentations.

Poster. See DataScientism_Poster.html and recreate from the source file DataScientism_Poster.Rmd. Simply change the name to your own, and knit the document.

Slides. See DataScientism_Slides.pdf and recreate from the source file DataScientism_Slides.Rmd.

Since beamer is a pdf output, you will need to install Latex. Alternatively, you can install a lightweight version TinyTex from within R

Code

install.packages('tinytex')
tinytex::install_tinytex()  # install TinyTeX

If you cannot install Latex, then you must specify a different output. For example, change output: beamer_presentation to output: ioslides_presentation on line 6 of the source file.

16.4 Applications

Shiny is an R package to build web applications.

Shiny Flexdashboards are nicely formatted Shiny Apps. While it is possible to use Shiny without the Flexdashboard formatting, I think it is easier to remember

.R files are codes for statistical analysis
.Rmd files are for communicating: reports, slides, posters, and apps

Example: Histogram. Download the source file TrialApp1_Histogram_Dashboard.Rmd and open it with rstudio. Then run it with

Code

rmarkdown::run('TrialApp1_Histogram_Dashboard.Rmd')

Within the app, experiment with how larger sample sizes change the distribution.
Edit the app to let the user specify the number of breaks in the histogram.

If you are having difficulty, you can try working first with the barebones shiny code. To do this, download TrialApp0_Histogram.Rmd and edit it in Rstudio. You can run the code with rmarkdown::run('TrialApp0_Histogram.Rmd').

16.5 Further Reading

Your code should be readable and error free. For code writing guides, see

For organization guidelines, see

For additional logging capabilities, see https://cran.r-project.org/web/packages/logr/

For very large projects, there are many more tools available at https://cran.r-project.org/web/views/ReproducibleResearch.html

For larger scale projects, use scripts

Some other good packages for posters/presenting you should be aware of

Overview of Applications

More Help with Shiny Apps