16 Large Projects
As you scale up a project, then you will have to be more organized.
16.1 Scripting
Save the following code as MyFirstScript.R
# Define New Function
sum_squared <- function(x1, x2) {
y <- (x1 + x2)^2
return(y)
}
# Test New Function
x <- c(0,1,3,10,6)
sum_squared(x[1], x[3])
sum_squared(x, x[2])
sum_squared(x, x[7])
sum_squared(x, x)
message('Script Completed')
Restart Rstudio.8
Replicate in another tab via
Note that you may first need to setwd()
so your computer knows where you saved your code.9
After you get this working:
- add a the line
print(sum_squared(y, y))
to the bottom ofMyFirstCode.R
. - apply the function to a vectors specified outside of that script
- record the session information
# Pass Objects/Functions *to* Script
y <- c(3,1,NA)
source('MyFirstScript.R')
# Pass Objects/Functions *from* Script
z <- sqrt(y)/2
sum_squared(z,z)
# Report all information relevant for replication
sessionInfo()
Note that you can open a new terminal in RStudio in the top bar by clicking ‘tools > terminal > new terminal’
16.2 Organization
Large sized projects should have their own Project
folder on your computer with files, subdirectories with files, and subsubdirectories with files. It should look like this
Project
└── README.txt
└── /Code
└── MAKEFILE.R
└── RBLOCK_001_DataClean.R
└── RBLOCK_002_Figures.R
└── RBLOCK_003_ModelsTests.R
└── RBLOCK_004_Robust.R
└── /Logs
└── MAKEFILE.Rout
└── /Data
└── /Raw
└── Source1.csv
└── Source2.shp
└── Source3.txt
└── /Clean
└── AllDatasets.Rdata
└── MainDataset1.Rds
└── MainDataset2.csv
└── /Output
└── MainFigure.pdf
└── AppendixFigure.pdf
└── MainTable.tex
└── AppendixTable.tex
└── /Writing
└── /TermPaper
└── TermPaper.tex
└── TermPaper.bib
└── TermPaper.pdf
└── /Slides
└── Slides.Rmd
└── Slides.html
└── Slides.pdf
└── /Poster
└── Poster.Rmd
└── Poster.html
└── Poster.pdf
└── /Proposal
└── Proposal.Rmd
└── Proposal.html
└── Proposal.pdf
There are two main meta-files
README.txt
overviews the project structure and what the codes are doingMAKEFILE
explicitly describes and executes all codes (and typically logs the output).
MAKEFILE.
If all code is written with the same program (such as R) the makefile can be written in a single language. For us, this looks like
# Project Structure
home_dir <- path.expand("~/Desktop/Project/")
data_dir_r <- paste0(data_dir, "Data/Raw/")
data_dir_c <- paste0(data_dir, "Data/Clean/")
out_dir <- paste0(hdir, "Output/")
code_dir <- paste0(hdir, "Code/")
# Execute Codes
# libraries are loaded within each RBLOCK
setwd( code_dir )
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )
# Report all information relevant for replication
sessionInfo()
Notice there is a lot of documentation # like this
, which is crucial for large projects. Also notice that anyone should be able to replicate the entire project by downloading a zip file and simply changing home_dir
.
If some folders or files need to be created, you can do this within R
16.3 Logging/Sinking
When executing the makefile, you can also log the output in three different ways:
- Inserting some code into the makefile that “sinks” the output
# Project Structure
home_dir <- path.expand("~/Desktop/Project/")
data_dir_r <- paste0(data_dir, "Data/Raw/")
data_dir_c <- paste0(data_dir, "Data/Clean/")
out_dir <- paste0(hdir, "Output/")
code_dir <- paste0(hdir, "Code/")
# Log Output
set.wd( code_dir )
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)
# Execute Codes
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )
sessionInfo()
# Stop Logging Output
sink()
- Starting a session that “sinks” the makefile
- Execute the makefile via the commandline
16.4 Class Projects
Zip your project into a single file that is easy for others to identify: Class_Project_LASTNAME_FIRSTNAME.zip
Your code should be readable and error free. For code writing guides, see
- https://google.github.io/styleguide/Rguide.html
- https://style.tidyverse.org/
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/codestyle.html
- http://adv-r.had.co.nz/Style.html
- https://www.burns-stat.com/pages/Tutor/R_inferno.pdf
For organization guidelines, see
- https://guides.lib.berkeley.edu/c.php?g=652220&p=4575532
- https://kbroman.org/steps2rr/pages/organize.html
- https://drivendata.github.io/cookiecutter-data-science/
- https://ecorepsci.github.io/reproducible-science/project-organization.html
For additional logging capabilities, see https://cran.r-project.org/web/packages/logr/
For very large projects, there are many more tools available at https://cran.r-project.org/web/views/ReproducibleResearch.html
For larger scale projects, use scripts