12 Large Scale Projects
As you scale up to a medium sized project, however, you will have to be more organized.
Medium sized projects should have their own Project
folder on your computer with files, subdirectories with files, and subsubdirectories with files. It should look like this
Project
└── README.txt
└── /Code
└── MAKEFILE.R
└── RBLOCK_001_DataClean.R
└── RBLOCK_002_Figures.R
└── RBLOCK_003_ModelsTests.R
└── RBLOCK_004_Robust.R
└── /Logs
└── MAKEFILE.Rout
└── /Data
└── /Raw
└── Source1.csv
└── Source2.shp
└── Source3.txt
└── /Clean
└── AllDatasets.Rdata
└── MainDataset1.Rds
└── MainDataset2.csv
└── /Output
└── MainFigure.pdf
└── AppendixFigure.pdf
└── MainTable.tex
└── AppendixTable.tex
└── /Writing
└── /TermPaper
└── TermPaper.tex
└── TermPaper.bib
└── TermPaper.pdf
└── /Slides
└── Slides.Rmd
└── Slides.html
└── Slides.pdf
└── /Poster
└── Poster.Rmd
└── Poster.html
└── Poster.pdf
└── /Proposal
└── Proposal.Rmd
└── Proposal.html
└── Proposal.pdf
12.1 MAKEFILE
There are two main meta-files
README.txt
overviews the project structure and what the codes are doingMAKEFILE
explicitly describes and executes all codes.
If all code is written with the same program, the makefile can be written in that programs code: MAKEFILE.R
, which looks like
### Project Structure
<- path.expand("~/Desktop/Project/")
home_dir
<- paste0(home_dir, "Data/")
data_dir <- paste0(data_dir, "Raw/")
data_dir_r <- paste0(data_dir, "Clean/")
data_dir_c
<- paste0(hdir, "Output/")
out_dir
<- paste0(hdir, "Code/")
code_dir
### Execute Codes
### libraries are loaded within each RBLOCK
set.wd( code_dir )
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )
If some folders or files are not created, you can do this within R
# create directory called 'Data'
dir.create('Data')
# list the files and directories
list.files(recursive=TRUE, include.dirs=TRUE)
12.2 Logging/Sinking
You can then execute the makefile within R and log the output. Either by
- Inserting some code that logs/sinks the output
### Project Structure
<- path.expand("~/Desktop/Project/")
home_dir
<- paste0(home_dir, "Data/")
data_dir <- paste0(data_dir, "Raw/")
data_dir_r <- paste0(data_dir, "Clean/")
data_dir_c
<- paste0(hdir, "Output/")
out_dir
<- paste0(hdir, "Code/")
code_dir
### Log Output
set.wd( code_dir )
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)
### Execute Codes
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )
### Stop Logging Output
sink()
- Starting a session that logs/sinks you sourcing the makefile
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)
source("MAKEFILE.R")
sink()
- Execute the makefile via the commandline
R CMD BATCH MAKEFILE.R MAKEFILE.Rout
12.3 Class Projects
Zip your project into a single file that is easy for others to identify: Class_Project_LASTNAME_FIRSTNAME.zip
Your code should be readable and error free. For code writing guides, see
- https://google.github.io/styleguide/Rguide.html
- https://style.tidyverse.org/
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/codestyle.html
- http://adv-r.had.co.nz/Style.html
- https://www.burns-stat.com/pages/Tutor/R_inferno.pdf
For organization guidelines, see
- https://guides.lib.berkeley.edu/c.php?g=652220&p=4575532
- https://kbroman.org/steps2rr/pages/organize.html
- https://drivendata.github.io/cookiecutter-data-science/
- https://ecorepsci.github.io/reproducible-science/project-organization.html
For additional logging capabilities, see https://cran.r-project.org/web/packages/logr/
For very large projects, there are many more tools available at https://cran.r-project.org/web/views/ReproducibleResearch.html
For larger scale projects, use scripts
12.4 Debugging
We can use the following packages to help deal with various problems that may arise
library(profvis)
library(bench)
library(parallel)
library(Rcpp)
library(compiler)
(Note that many of the examples are taken from https://adv-r.hadley.nz/).
Problems print to the console
message("This is what a message looks like")
## This is what a message looks like
warning("This is what a warning looks like")
## Warning: This is what a warning looks like
stop("This is what an error looks like")
## Error in eval(expr, envir, enclos): This is what an error looks like
Nonproblems also print to the console
cat('cat\n')
## cat
print('print')
## [1] "print"
12.4.1 Tracing Errors
First, use garbage cleanup to “cleanup” memory leaks
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1209805 64.7 2358952 126 2358952 126.0
## Vcells 2294794 17.6 8388608 64 3570923 27.3
Then trace the error process.
Example of error process
## Let i() check if its argument is numeric
<- function(i0) {
i if ( !is.numeric(i0) ) {
stop("`d` must be numeric", call.=FALSE)
}+ 10
i0
}
## Let f() call g() call h() call i()
<- function(i0) i(i0)
h <- function(h0) h(h0)
g <- function(g0) g(g0)
f
## Observe Error
f("a")
## Error: `d` must be numeric
Traceback debugging
traceback()
## No traceback available
Simple print debugging
<- function(g0) {
f2 cat("f2 calls g2()\n")
g2(g0)
}<- function(h0) {
g2 cat("g2 calls h2() \n")
cat("b =", h0, "\n")
h2(h0)
}<- function(i0) {
h2 cat("h2 call i() \n")
i(i0)
}
f2("a")
## f2 calls g2()
## g2 calls h2()
## b = a
## h2 call i()
## Error: `d` must be numeric
Interactive approach
<- function(h0) {
g3 browser()
h(h0)
}<- function(g0){
f3 g3(g0)
}f3("a")
## Called from: g3(g0)
## debug at <text>#3: h(h0)
## Error: `d` must be numeric
12.4.2 Checking problems
To inspect objects
is.object(f)
is.object(c(1,1))
class(f)
class(c(1,1))
## Storage Mode Type
typeof(f)
typeof(c(1,1))
storage.mode(f)
storage.mode(c(1,1))
To check for valid inputs/outputs
<- c(NA, NULL, NaN, Inf, 0)
x
cat("Vector to inspect: ")
x
cat("NA: ")
is.na(x)
cat("NULL: ")
is.null(x)
cat("NaN: ")
is.nan(x)
cat("Finite: ")
is.finite(x)
cat("Infinite: ")
is.infinite(x)
## Many others
To check for values
all( x > -2 )
any( x > -2 )
## Check Matrix Rows
<- function(x) rowSums(x) > 0
rowAny <- function(x) rowSums(x) == ncol(x) rowAll
12.5 Being Proactive
Supressing errors is possible but a bad idea
try(1+2, silent=T)
## [1] 3
try(warning('warning'), silent=T)
## Warning in doTryCatch(return(expr), name, parentenv, handler): warning
try(error('error'), silent=T)
try(1+2, silent=F)
## [1] 3
try(warning('warning'), silent=F)
## Warning in doTryCatch(return(expr), name, parentenv, handler): warning
try(error('error'), silent=F)
## Error in error("error") : could not find function "error"
Try to handle errors
tryCatch(
error = function(e) {
# code to run when error is thrown
},
code_to_run_while_handlers_are_active )
Simple Example
tryCatch(
expr = {
message( log(-Inf) )
message("Successfully executed the log(x) call.")
},error = function(e){
message('Caught an error!')
print(e)
},warning = function(w){
message('Caught an warning!')
print(w)
},finally = {
message('All done, quitting.')
} )
## Caught an warning!
## <simpleWarning in log(-Inf): NaNs produced>
## All done, quitting.
Safe Functions
## Define
<- function(x){
log_safe tryCatch(
expr = {
message( log(x) )
message("Successfully executed the log(x) call.")
},error = function(e){
message('Caught an error!')
print(e)
},warning = function(w){
message('Caught an warning!')
print(w)
},finally = {
message('All done, quitting.')
}
)
}
## Test
log_safe( 10)
## 2.30258509299405
## Successfully executed the log(x) call.
## All done, quitting.
log_safe(-10)
## Caught an warning!
## <simpleWarning in log(x): NaNs produced>
## All done, quitting.
log_safe(' ')
## Caught an error!
## <simpleError in log(x): non-numeric argument to mathematical function>
## All done, quitting.
12.6 Optimizing
In General: Clean code is often faster and less error prone
Repetitive tasks can be optimized You end up with code that
- is cleaner, faster, and more general
- can be easily parallelized
Computers have big memories and are really good at math.
- First try vectors
- then try
apply
functions
Don’t waste time on code that is not holding you back.
- Your code may break, be slow, or incorrect.
- Look at what has already done.
12.6.1 Benchmarking
The simplest approach
system.time({
<- runif(1e5)
x sqrt(x)
})
## user system elapsed
## 0.006 0.000 0.006
For identifying bottlenecks
## Generate Large Random Dataset
<- 2e6
n <- runif(n)
x <- runif(n)
y <- runif(n)
z <- cbind(x,y,z)
XYZ
## Inspect 4 equivalent `row mean` calculations
::profvis({
profvis<- rowSums(XYZ)/ncol(XYZ)
m <- rowMeans(XYZ)
m <- apply(XYZ, 1, mean)
m <- rep(NA, n); for(i in 1:n){ m[i] <- (x[i] + y[i] + z[i]) / 3 }
m })
## rowSums(), colSums(), rowMeans(), and colMeans() are vectorised and fast.
## for loop is not the slowest, but the ugliest.
For easily comparing solutions to specific bottlenecks
## 3 Equivalent calculations of the mean of a vector
<- function(x,p=1) mean(x^p)
mean1 <- function(x,p=1) sum(x^p) / length(x)
mean2 <- function(x,p=1) mean.default(x^p)
mean3
## Time them
<- runif(1e6)
x ::mark(
benchmean1(x,.5),
mean2(x,.5),
mean3(x,.5)
)
## # A tibble: 3 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 mean1(x, 0.5) 39.2ms 42.1ms 19.5 7.63MB 0
## 2 mean2(x, 0.5) 39.2ms 40.9ms 23.4 7.63MB 2.13
## 3 mean3(x, 0.5) 36.3ms 40.3ms 25.2 7.63MB 0
Check for easy speed-ups before creating your own program
<- cbind(1, runif(1e6))
X <- X %*% c(1,2) + rnorm(1e6)
Y <- as.data.frame(cbind(Y,X))
DAT
system.time({.lm.fit(X, Y) })
## user system elapsed
## 0.136 0.000 0.044
system.time({ lm(Y~X, data=DAT) })
## user system elapsed
## 0.523 0.000 0.194
## But note that quicker codes
## tend to have fewer checks
## and return less information
Vectors are generally faster and easier to read than loops
<- runif(1e6)
x
## Compare 2 moving averages
## First Try
<- function(y){
ma1 <- y*NA
z for(i in 2:length(y)){
<- (y[i]-y[i-1])/2
z[i]
}return(z)
}
# Optimized using diff
diff( c(2,2,10,9) )
## [1] 0 8 -1
<- function(y){
ma2 <- diff(y)/2
z2 <- c(NA, z2)
z2 return(z2)
}
all.equal(ma1(y),ma2(y))
## [1] TRUE
::mark(
benchma1(y),
ma2(y)
)
## # A tibble: 2 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 ma1(y) 279ms 279ms 3.58 15.3MB 3.58
## 2 ma2(y) 28.8ms 29ms 34.4 91.6MB 207.
12.6.2 Bottlenecks
Sometimes there will still be a problematic bottleneck.
Your next step should be parallelism:
- Write the function as a general vectorized function.
- Apply the same function to every element in a list at the same time
## lapply in parallel on {m}ultiple {c}ores
<- c(10,20,30,40,50)
x <- function(element) { element^element }
f ::mclapply( x, mc.cores=2, FUN=f) parallel
## [[1]]
## [1] 1e+10
##
## [[2]]
## [1] 1.048576e+26
##
## [[3]]
## [1] 2.058911e+44
##
## [[4]]
## [1] 1.208926e+64
##
## [[5]]
## [1] 8.881784e+84
More power is often not the solution
## vectorize and compile
<- compiler::cmpfun( function(vector){ vector^vector} )
e_power_e_fun
## base R
<- 0:1E6
x <- system.time( e_power_e_vec <- e_power_e_fun(x) )
s_vc s_vc
## user system elapsed
## 0.037 0.001 0.038
## brute power
<- 0:1E6
x <- system.time({
s_bp <- unlist( parallel::mclapply(x, mc.cores=2, FUN=e_power_e_fun))
e_power_e_mc
}) s_bp
## user system elapsed
## 1.330 0.364 0.962
## Same results
all(e_power_e_vec==e_power_e_mc)
## [1] TRUE
Parallelism does not go great with a GUI.
For identifying bottlenecks on a cluster without a GUI, try Rprof()
Rprof( interval = 0.005)
# Create Data
<- runif(2e6)
x <- sqrt(x)
y ## Loop Format Data
<- y*NA
z for(i in 2:length(y)){ z[i] <- (y[i]-y[i-1])/2 }
## Regression
<- cbind(1,x)[-1,]
X <- z[-1]
Z <- .lm.fit(X, Z)
reg_fast Rprof(NULL)
summaryRprof()
If you still are stuck, you can
- try Amazon Web Server for more brute-power
- rewrite bottlenecks with a working C++ compiler or Fortran compiler.
12.6.2.0.1 Compiled Code
To get C++
- On Windows, install Rtools.
- On Mac, install Xcode from the app store.
- On Linux, sudo apt-get install r-base-dev or similar.
To call C++ from R use package Rcpp
::cppFunction('
Rcpp int add(int x, int y, int z) {
int sum = x + y + z;
return sum;
}'
)add(1, 2, 3)
## [1] 6
For help getting started with Rcpp, see https://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-quickref.pdf
First try to use C++ (or Fortran) code that others have written
.C .Fortran
For a tutorial, see https://masuday.github.io/fortran_tutorial/r.html
12.7 Further Programming Guidance
https://rmd4sci.njtierney.com/ https://smac-group.github.io/ds/high-performance-computing.html https://www.stat.umn.edu/geyer/3701/notes/arithmetic.Rmd
For debugging tips
- https://cran.r-project.org/doc/manuals/R-lang.html#Debugging
- https://cran.r-project.org/doc/manuals/R-exts.html#Debugging
- https://adv-r.hadley.nz/debugging.html
- https://adv-r.hadley.nz/conditions.html
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/debugging.html
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/functions.html
For optimization tips
- https://cran.r-project.org/doc/manuals/R-exts.html#Tidying-and-profiling-R-code
- https://cran.r-project.org/doc/manuals/R-lang.html#Exception-handling
- https://adv-r.hadley.nz/perf-measure.html.libPaths()
- https://adv-r.hadley.nz/perf-improve.html
- https://cran.r-project.org/doc/manuals/R-exts.html#System-and-foreign-language-interfaces https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/profiling.html
- https://adv-r.hadley.nz/rcpp.html
For parallel programming
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html
- https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html
- https://grantmcdermott.com/ds4e/parallel.html
- https://psu-psychology.github.io/r-bootcamp-2018/talks/parallel_r.html
For general tips
- https://github.com/christophergandrud/Rep-Res-Book
- Efficient R programming. C. Gillespie R. Lovelace. 2021. https://csgillespie.github.io/efficientR/
- Data Science at the Command Line, 1e. Janssens J. 2020. https://www.datascienceatthecommandline.com/1e/
- R Programming for Data Science. Peng R. 2020. https://bookdown.org/rdpeng/rprogdatascience/
- Advanced R. H. Wickham 2019. https://adv-r.hadley.nz/
- Econometrics in R. Grant Farnsworth. 2008. http://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf
- The R Inferno. https://www.burns-stat.com/documents/books/the-r-inferno/