12 Large Scale Projects


As you scale up to a medium sized project, however, you will have to be more organized.

Medium sized projects should have their own Project folder on your computer with files, subdirectories with files, and subsubdirectories with files. It should look like this

Project
    └── README.txt
    └── /Code
        └── MAKEFILE.R
        └── RBLOCK_001_DataClean.R
        └── RBLOCK_002_Figures.R
        └── RBLOCK_003_ModelsTests.R
        └── RBLOCK_004_Robust.R
        └── /Logs
            └── MAKEFILE.Rout
    └── /Data
        └── /Raw
            └── Source1.csv
            └── Source2.shp
            └── Source3.txt
        └── /Clean
            └── AllDatasets.Rdata
            └── MainDataset1.Rds
            └── MainDataset2.csv
    └── /Output
        └── MainFigure.pdf
        └── AppendixFigure.pdf
        └── MainTable.tex
        └── AppendixTable.tex
    └── /Writing
        └── /TermPaper
            └── TermPaper.tex
            └── TermPaper.bib
            └── TermPaper.pdf
        └── /Slides
            └── Slides.Rmd
            └── Slides.html
            └── Slides.pdf
        └── /Poster
            └── Poster.Rmd
            └── Poster.html
            └── Poster.pdf
        └── /Proposal
            └── Proposal.Rmd
            └── Proposal.html
            └── Proposal.pdf

12.1 MAKEFILE

There are two main meta-files

  • README.txt overviews the project structure and what the codes are doing
  • MAKEFILE explicitly describes and executes all codes.

If all code is written with the same program, the makefile can be written in that programs code: MAKEFILE.R, which looks like

### Project Structure
home_dir    <- path.expand("~/Desktop/Project/")

data_dir    <- paste0(home_dir, "Data/")
data_dir_r  <- paste0(data_dir, "Raw/")
data_dir_c  <- paste0(data_dir, "Clean/")

out_dir  <- paste0(hdir, "Output/")

code_dir  <- paste0(hdir, "Code/")


### Execute Codes
### libraries are loaded within each RBLOCK
set.wd( code_dir )
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )

If some folders or files are not created, you can do this within R

# create directory called 'Data'
dir.create('Data')

# list the files and directories
list.files(recursive=TRUE, include.dirs=TRUE)

12.2 Logging/Sinking

You can then execute the makefile within R and log the output. Either by

  1. Inserting some code that logs/sinks the output
### Project Structure
home_dir    <- path.expand("~/Desktop/Project/")

data_dir    <- paste0(home_dir, "Data/")
data_dir_r  <- paste0(data_dir, "Raw/")
data_dir_c  <- paste0(data_dir, "Clean/")

out_dir  <- paste0(hdir, "Output/")

code_dir  <- paste0(hdir, "Code/")

### Log Output
set.wd( code_dir )
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)

### Execute Codes
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )

### Stop Logging Output
sink()
  1. Starting a session that logs/sinks you sourcing the makefile
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)
source("MAKEFILE.R")
sink()
  1. Execute the makefile via the commandline
R CMD BATCH MAKEFILE.R MAKEFILE.Rout

12.4 Debugging

We can use the following packages to help deal with various problems that may arise

library(profvis)
library(bench)
library(parallel)
library(Rcpp)
library(compiler)

(Note that many of the examples are taken from https://adv-r.hadley.nz/).

Problems print to the console

message("This is what a message looks like")
## This is what a message looks like
warning("This is what a warning looks like")
## Warning: This is what a warning looks like
stop("This is what an error looks like")
## Error in eval(expr, envir, enclos): This is what an error looks like

Nonproblems also print to the console

cat('cat\n')
## cat
print('print')
## [1] "print"

12.4.1 Tracing Errors

First, use garbage cleanup to “cleanup” memory leaks

gc()
##           used (Mb) gc trigger (Mb) max used  (Mb)
## Ncells 1209805 64.7    2358952  126  2358952 126.0
## Vcells 2294794 17.6    8388608   64  3570923  27.3

Then trace the error process.

Example of error process

## Let i() check if its argument is numeric
i <- function(i0) {
  if ( !is.numeric(i0) ) {
    stop("`d` must be numeric", call.=FALSE)
  }
  i0 + 10
}

## Let f() call g() call h() call i()
h <- function(i0) i(i0)
g <- function(h0) h(h0)
f <- function(g0) g(g0)

## Observe Error
f("a")
## Error: `d` must be numeric

Traceback debugging

traceback()
## No traceback available

Simple print debugging

f2 <- function(g0) {
  cat("f2 calls g2()\n")
  g2(g0)
}
g2 <- function(h0) {
  cat("g2 calls h2() \n")
  cat("b =", h0, "\n")
  h2(h0)
}
h2 <- function(i0) {
  cat("h2 call i() \n")
  i(i0)
}

f2("a")
## f2 calls g2()
## g2 calls h2() 
## b = a 
## h2 call i()
## Error: `d` must be numeric

Interactive approach

g3 <- function(h0) {
  browser()
  h(h0)
}
f3 <- function(g0){
  g3(g0)
}
f3("a")
## Called from: g3(g0)
## debug at <text>#3: h(h0)
## Error: `d` must be numeric

12.4.2 Checking problems

To inspect objects

is.object(f)
is.object(c(1,1))

class(f)
class(c(1,1))

## Storage Mode Type 
typeof(f)
typeof(c(1,1))

storage.mode(f)
storage.mode(c(1,1))

To check for valid inputs/outputs

x <- c(NA, NULL, NaN, Inf, 0)

cat("Vector to inspect: ")
x

cat("NA: ")
is.na(x)

cat("NULL: ")
is.null(x)

cat("NaN: ")
is.nan(x)

cat("Finite: ")
is.finite(x)

cat("Infinite: ")
is.infinite(x)
## Many others

To check for values

all( x > -2 )
any( x > -2 )
## Check Matrix Rows
rowAny <- function(x) rowSums(x) > 0
rowAll <- function(x) rowSums(x) == ncol(x)

12.5 Being Proactive

Supressing errors is possible but a bad idea

try(1+2, silent=T)
## [1] 3
try(warning('warning'), silent=T)
## Warning in doTryCatch(return(expr), name, parentenv, handler): warning
try(error('error'), silent=T)
try(1+2, silent=F)
## [1] 3
try(warning('warning'), silent=F)
## Warning in doTryCatch(return(expr), name, parentenv, handler): warning
try(error('error'), silent=F)
## Error in error("error") : could not find function "error"

Try to handle errors

tryCatch(
  error = function(e) {
    # code to run when error is thrown
  },
  code_to_run_while_handlers_are_active
)

Simple Example

tryCatch(
    expr = {
        message( log(-Inf) )
        message("Successfully executed the log(x) call.")
    },
    error = function(e){
        message('Caught an error!')
        print(e)
    },
    warning = function(w){
        message('Caught an warning!')
        print(w)
    },
    finally = {
        message('All done, quitting.')
    }
)
## Caught an warning!
## <simpleWarning in log(-Inf): NaNs produced>
## All done, quitting.

Safe Functions

## Define
log_safe <- function(x){
    tryCatch(
        expr = {
            message( log(x) )
            message("Successfully executed the log(x) call.")
        },
        error = function(e){
            message('Caught an error!')
            print(e)
        },
        warning = function(w){
            message('Caught an warning!')
            print(w)
        },
        finally = {
            message('All done, quitting.')
        }
    )
}

## Test 
log_safe( 10)
## 2.30258509299405
## Successfully executed the log(x) call.
## All done, quitting.
log_safe(-10)
## Caught an warning!
## <simpleWarning in log(x): NaNs produced>
## All done, quitting.
log_safe(' ')
## Caught an error!
## <simpleError in log(x): non-numeric argument to mathematical function>
## All done, quitting.

12.6 Optimizing

In General: Clean code is often faster and less error prone

Repetitive tasks can be optimized You end up with code that

  • is cleaner, faster, and more general
  • can be easily parallelized

Computers have big memories and are really good at math.

  • First try vectors
  • then try apply functions

Don’t waste time on code that is not holding you back.

  • Your code may break, be slow, or incorrect.
  • Look at what has already done.

12.6.1 Benchmarking

The simplest approach

system.time({
    x <- runif(1e5)
    sqrt(x)
})
##    user  system elapsed 
##   0.006   0.000   0.006

For identifying bottlenecks

## Generate Large Random Dataset
n <- 2e6
x <- runif(n)
y <- runif(n)
z <- runif(n)
XYZ <- cbind(x,y,z)

## Inspect 4 equivalent `row mean` calculations 
profvis::profvis({
    m <- rowSums(XYZ)/ncol(XYZ)
    m <- rowMeans(XYZ)
    m <- apply(XYZ, 1, mean)
    m <- rep(NA, n);  for(i in 1:n){ m[i] <- (x[i] + y[i] + z[i]) / 3 }
})
## rowSums(), colSums(), rowMeans(), and colMeans() are vectorised and fast.
## for loop is not the slowest, but the ugliest.

For easily comparing solutions to specific bottlenecks

## 3 Equivalent calculations of the mean of a vector
mean1 <- function(x,p=1) mean(x^p)
mean2 <- function(x,p=1) sum(x^p) / length(x)
mean3 <- function(x,p=1) mean.default(x^p)

## Time them
x <- runif(1e6)
bench::mark(
  mean1(x,.5),
  mean2(x,.5),
  mean3(x,.5)
)
## # A tibble: 3 × 6
##   expression         min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 mean1(x, 0.5)   39.2ms   42.1ms      19.5    7.63MB     0   
## 2 mean2(x, 0.5)   39.2ms   40.9ms      23.4    7.63MB     2.13
## 3 mean3(x, 0.5)   36.3ms   40.3ms      25.2    7.63MB     0

Check for easy speed-ups before creating your own program

X <- cbind(1, runif(1e6))
Y <- X %*% c(1,2) + rnorm(1e6)
DAT <- as.data.frame(cbind(Y,X))

system.time({.lm.fit(X, Y) })
##    user  system elapsed 
##   0.136   0.000   0.044
system.time({ lm(Y~X, data=DAT) })
##    user  system elapsed 
##   0.523   0.000   0.194
## But note that quicker codes 
## tend to have fewer checks
## and return less information

Vectors are generally faster and easier to read than loops

x <- runif(1e6)

## Compare 2 moving averages

## First Try
ma1 <- function(y){
    z <- y*NA
    for(i in 2:length(y)){
        z[i] <- (y[i]-y[i-1])/2
    }
    return(z)
}

# Optimized using diff
diff( c(2,2,10,9) )
## [1]  0  8 -1
ma2 <- function(y){
    z2 <- diff(y)/2
    z2 <- c(NA, z2) 
    return(z2)
}

all.equal(ma1(y),ma2(y))
## [1] TRUE
bench::mark(
  ma1(y),
  ma2(y)
)
## # A tibble: 2 × 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 ma1(y)        279ms    279ms      3.58    15.3MB     3.58
## 2 ma2(y)       28.8ms     29ms     34.4     91.6MB   207.

12.6.2 Bottlenecks

Sometimes there will still be a problematic bottleneck.

Your next step should be parallelism:

  • Write the function as a general vectorized function.
  • Apply the same function to every element in a list at the same time
## lapply in parallel on {m}ultiple {c}ores
x <- c(10,20,30,40,50)
f <- function(element) { element^element } 
parallel::mclapply( x, mc.cores=2, FUN=f)
## [[1]]
## [1] 1e+10
## 
## [[2]]
## [1] 1.048576e+26
## 
## [[3]]
## [1] 2.058911e+44
## 
## [[4]]
## [1] 1.208926e+64
## 
## [[5]]
## [1] 8.881784e+84

More power is often not the solution

## vectorize and compile
e_power_e_fun <- compiler::cmpfun( function(vector){ vector^vector} )

## base R
x <- 0:1E6
s_vc <- system.time( e_power_e_vec <- e_power_e_fun(x) )
s_vc
##    user  system elapsed 
##   0.037   0.001   0.038
## brute power
x <- 0:1E6
s_bp <- system.time({
  e_power_e_mc <- unlist( parallel::mclapply(x, mc.cores=2, FUN=e_power_e_fun))
})
s_bp
##    user  system elapsed 
##   1.330   0.364   0.962
## Same results
all(e_power_e_vec==e_power_e_mc)
## [1] TRUE

Parallelism does not go great with a GUI. For identifying bottlenecks on a cluster without a GUI, try Rprof()

Rprof( interval = 0.005)
    # Create Data
    x <- runif(2e6)
    y <- sqrt(x)
    ## Loop Format Data
    z <- y*NA
    for(i in 2:length(y)){ z[i] <- (y[i]-y[i-1])/2 }
    ## Regression
    X <- cbind(1,x)[-1,]
    Z <- z[-1]
    reg_fast <- .lm.fit(X, Z)
Rprof(NULL)
summaryRprof()

If you still are stuck, you can

  • try Amazon Web Server for more brute-power
  • rewrite bottlenecks with a working C++ compiler or Fortran compiler.
12.6.2.0.1 Compiled Code

To get C++

  • On Windows, install Rtools.
  • On Mac, install Xcode from the app store.
  • On Linux, sudo apt-get install r-base-dev or similar.

To call C++ from R use package Rcpp

Rcpp::cppFunction('
  int add(int x, int y, int z) {
    int sum = x + y + z;
    return sum;
  }'
)
add(1, 2, 3)
## [1] 6

For help getting started with Rcpp, see https://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-quickref.pdf

First try to use C++ (or Fortran) code that others have written

.C
.Fortran

For a tutorial, see https://masuday.github.io/fortran_tutorial/r.html

12.7 Further Programming Guidance

https://rmd4sci.njtierney.com/ https://smac-group.github.io/ds/high-performance-computing.html https://www.stat.umn.edu/geyer/3701/notes/arithmetic.Rmd

For debugging tips

For optimization tips

For parallel programming

For general tips