16 Large Projects


As you scale up a project, then you will have to be more organized.

16.1 Scripting

Save the following code as MyFirstScript.R

# Define New Function
sum_squared <- function(x1, x2) {
    y <- (x1 + x2)^2
    return(y)
} 

# Test New Function
x <- c(0,1,3,10,6)
sum_squared(x[1], x[3])
sum_squared(x, x[2])
sum_squared(x, x[7])
sum_squared(x, x)

message('Script Completed')

Restart Rstudio.8

Replicate in another tab via

source('MyFirstScript.R')

Note that you may first need to setwd() so your computer knows where you saved your code.9

After you get this working:

  • add a the line print(sum_squared(y, y)) to the bottom of MyFirstCode.R.
  • apply the function to a vectors specified outside of that script
  • record the session information
# Pass Objects/Functions *to* Script
y <- c(3,1,NA)
source('MyFirstScript.R')

# Pass Objects/Functions *from* Script
z <- sqrt(y)/2
sum_squared(z,z)

# Report all information relevant for replication
sessionInfo()

Note that you can open a new terminal in RStudio in the top bar by clicking ‘tools > terminal > new terminal’

16.2 Organization

Large sized projects should have their own Project folder on your computer with files, subdirectories with files, and subsubdirectories with files. It should look like this

Project
    └── README.txt
    └── /Code
        └── MAKEFILE.R
        └── RBLOCK_001_DataClean.R
        └── RBLOCK_002_Figures.R
        └── RBLOCK_003_ModelsTests.R
        └── RBLOCK_004_Robust.R
        └── /Logs
            └── MAKEFILE.Rout
    └── /Data
        └── /Raw
            └── Source1.csv
            └── Source2.shp
            └── Source3.txt
        └── /Clean
            └── AllDatasets.Rdata
            └── MainDataset1.Rds
            └── MainDataset2.csv
    └── /Output
        └── MainFigure.pdf
        └── AppendixFigure.pdf
        └── MainTable.tex
        └── AppendixTable.tex
    └── /Writing
        └── /TermPaper
            └── TermPaper.tex
            └── TermPaper.bib
            └── TermPaper.pdf
        └── /Slides
            └── Slides.Rmd
            └── Slides.html
            └── Slides.pdf
        └── /Poster
            └── Poster.Rmd
            └── Poster.html
            └── Poster.pdf
        └── /Proposal
            └── Proposal.Rmd
            └── Proposal.html
            └── Proposal.pdf

There are two main meta-files

  • README.txt overviews the project structure and what the codes are doing
  • MAKEFILE explicitly describes and executes all codes (and typically logs the output).

MAKEFILE.

If all code is written with the same program (such as R) the makefile can be written in a single language. For us, this looks like

# Project Structure
home_dir    <- path.expand("~/Desktop/Project/")
data_dir_r  <- paste0(data_dir, "Data/Raw/")
data_dir_c  <- paste0(data_dir, "Data/Clean/")
out_dir     <- paste0(hdir, "Output/")
code_dir    <- paste0(hdir, "Code/")

# Execute Codes
# libraries are loaded within each RBLOCK
setwd( code_dir )
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )

# Report all information relevant for replication
sessionInfo()

Notice there is a lot of documentation # like this, which is crucial for large projects. Also notice that anyone should be able to replicate the entire project by downloading a zip file and simply changing home_dir.

If some folders or files need to be created, you can do this within R

# list the files and directories
list.files(recursive=TRUE, include.dirs=TRUE)
# create directory called 'Data'
dir.create('Data')

16.3 Logging/Sinking

When executing the makefile, you can also log the output in three different ways:

  1. Inserting some code into the makefile that “sinks” the output
# Project Structure
home_dir    <- path.expand("~/Desktop/Project/")
data_dir_r  <- paste0(data_dir, "Data/Raw/")
data_dir_c  <- paste0(data_dir, "Data/Clean/")
out_dir     <- paste0(hdir, "Output/")
code_dir    <- paste0(hdir, "Code/")

# Log Output
set.wd( code_dir )
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)

# Execute Codes
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )
sessionInfo()

# Stop Logging Output
sink()
  1. Starting a session that “sinks” the makefile
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)
source("MAKEFILE.R")
sink()
  1. Execute the makefile via the commandline
R CMD BATCH MAKEFILE.R MAKEFILE.Rout

  1. Alternatively, clean the workspace by 1: clearing the environment and history (use the broom in top right panel). 2: clearing unsaved plots (use the broom in bottom right panel).↩︎

  2. You can also use GUI: point-click click ‘Source > Source as a local job’ on top right↩︎