13 Large Projects


As you scale up a project, then you will have to be more organized.

Medium and large sized projects should have their own Project folder on your computer with files, subdirectories with files, and subsubdirectories with files. It should look like this

Project
    └── README.txt
    └── /Code
        └── MAKEFILE.R
        └── RBLOCK_001_DataClean.R
        └── RBLOCK_002_Figures.R
        └── RBLOCK_003_ModelsTests.R
        └── RBLOCK_004_Robust.R
        └── /Logs
            └── MAKEFILE.Rout
    └── /Data
        └── /Raw
            └── Source1.csv
            └── Source2.shp
            └── Source3.txt
        └── /Clean
            └── AllDatasets.Rdata
            └── MainDataset1.Rds
            └── MainDataset2.csv
    └── /Output
        └── MainFigure.pdf
        └── AppendixFigure.pdf
        └── MainTable.tex
        └── AppendixTable.tex
    └── /Writing
        └── /TermPaper
            └── TermPaper.tex
            └── TermPaper.bib
            └── TermPaper.pdf
        └── /Slides
            └── Slides.Rmd
            └── Slides.html
            └── Slides.pdf
        └── /Poster
            └── Poster.Rmd
            └── Poster.html
            └── Poster.pdf
        └── /Proposal
            └── Proposal.Rmd
            └── Proposal.html
            └── Proposal.pdf

There are two main meta-files

  • README.txt overviews the project structure and what the codes are doing
  • MAKEFILE explicitly describes and executes all codes (and typically logs the output).

MAKEFILE.

If all code is written with the same program (such as R) the makefile can be written in a single language. For us, this looks like

# Project Structure
home_dir    <- path.expand("~/Desktop/Project/")
data_dir_r  <- paste0(data_dir, "Data/Raw/")
data_dir_c  <- paste0(data_dir, "Data/Clean/")
out_dir     <- paste0(hdir, "Output/")
code_dir    <- paste0(hdir, "Code/")

# Execute Codes
# libraries are loaded within each RBLOCK
setwd( code_dir )
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )

# Report all information relevant for replication
sessionInfo()

Notice there is a lot of documentation # like this, which is crucial for large projects. Also notice that anyone should be able to replicate the entire project by downloading a zip file and simply changing home_dir.

If some folders or files need to be created, you can do this within R

# list the files and directories
list.files(recursive=TRUE, include.dirs=TRUE)
# create directory called 'Data'
dir.create('Data')

13.1 Logging/Sinking

When executing the makefile, you can also log the output in three different ways:

  1. Inserting some code into the makefile that “sinks” the output
# Project Structure
home_dir    <- path.expand("~/Desktop/Project/")
data_dir_r  <- paste0(data_dir, "Data/Raw/")
data_dir_c  <- paste0(data_dir, "Data/Clean/")
out_dir     <- paste0(hdir, "Output/")
code_dir    <- paste0(hdir, "Code/")

# Log Output
set.wd( code_dir )
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)

# Execute Codes
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )
sessionInfo()

# Stop Logging Output
sink()
  1. Starting a session that “sinks” the makefile
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)
source("MAKEFILE.R")
sink()
  1. Execute the makefile via the commandline
R CMD BATCH MAKEFILE.R MAKEFILE.Rout

13.3 Debugging

In R, you use multiple functions on different types of data objects. Moreover, you “typically solve complex problems by decomposing them into simple functions, not simple objects.” (H. Wickham)

We can use the following packages to help deal with various problems that may arise

library(microbenchmark)
library(compiler)
library(profvis)
library(parallel)
library(Rcpp)

Problems print to the console

message("This is what a message looks like")

warning("This is what a warning looks like")

stop("This is what an error looks like")
## Error: This is what an error looks like

Nonproblems also print to the console

cat('cat\n')
## cat
print('print')
## [1] "print"

Tracing.

Consider this example of an error process (originally taken from https://adv-r.hadley.nz/ ).

# Let i() check if its argument is numeric
i <- function(i0) {
  if ( !is.numeric(i0) ) {
    stop("`d` must be numeric", call.=FALSE)
  }
  i0 + 10
}

# Let f() call g() call h() call i()
h <- function(i0) i(i0)
g <- function(h0) h(h0)
f <- function(g0) g(g0)

# Observe Error
f("a")
## Error: `d` must be numeric

First try simple print debugging

f2 <- function(g0) {
  cat("f2 calls g2()\n")
  g2(g0)
}
g2 <- function(h0) {
  cat("g2 calls h2() \n")
  cat("b =", h0, "\n")
  h2(h0)
}
h2 <- function(i0) {
  cat("h2 call i() \n")
  i(i0)
}

f2("a")
## f2 calls g2()
## g2 calls h2() 
## b = a 
## h2 call i()
## Error: `d` must be numeric

If that fails, try traceback debugging

traceback()
## No traceback available

And if that fails, try an Interactive approach

g3 <- function(h0) {
  browser()
  h(h0)
}
f3 <- function(g0){
  g3(g0)
}
f3("a")
## Called from: g3(g0)
## debug: h(h0)
## Error: `d` must be numeric

Isolating.

To inspect objects

is.object(f)
is.object(c(1,1))

class(f)
class(c(1,1))

# Storage Mode Type 
typeof(f)
typeof(c(1,1))

storage.mode(f)
storage.mode(c(1,1))

To check for valid inputs/outputs

x <- c(NA, NULL, NaN, Inf, 0)

cat("Vector to inspect: ")
x

cat("NA: ")
is.na(x)

cat("NULL: ")
is.null(x)

cat("NaN: ")
is.nan(x)

cat("Finite: ")
is.finite(x)

cat("Infinite: ")
is.infinite(x)
# Many others

To check for values

all( x > -2 )
any( x > -2 )
# Check Matrix Rows
rowAny <- function(x) rowSums(x) > 0
rowAll <- function(x) rowSums(x) == ncol(x)

Handling.

Simplest example

x <- 'A'
tryCatch(
  expr = log(x),
  error = function(e) {
        message('Caught an error but did not break')
        print(e)
        return(NA)
})

Another example

x <- -2
tryCatch(
  expr = log(x),
  error = function(e) {
        message('Caught an error but did not break')
        print(e)
        return(NA)
    },
    warning = function(w){
        message('Caught a warning!')
        print(w)
        return(NA)        
    },
    finally = {
        message("Returned log(x) if successfull or NA if Error or Warning")
    }
)
## <simpleWarning in log(x): NaNs produced>
## [1] NA

Safe Functions

# Define
log_safe <- function(x){
    lnx <- tryCatch(
        expr = log(x),
        error = function(e){
            cat('Error Caught: \n\t')        
            print(e)
            return(NA)            
        },
        warning = function(w){
            cat('Warning Caught: \n\t')
            print(w)
            return(NA)            
        })
    return(lnx)
}

# Test 
log_safe( 1)
## [1] 0
log_safe(-1)
## Warning Caught: 
##  <simpleWarning in log(x): NaNs produced>
## [1] NA
log_safe(' ')
## Error Caught: 
##  <simpleError in log(x): non-numeric argument to mathematical function>
## [1] NA
# Further Tests
s <- sapply(list("A",Inf, -Inf, NA, NaN, 0, -1, 1), log_safe)
## Error Caught: 
##  <simpleError in log(x): non-numeric argument to mathematical function>
## Warning Caught: 
##  <simpleWarning in log(x): NaNs produced>
## Warning Caught: 
##  <simpleWarning in log(x): NaNs produced>
s
## [1]   NA  Inf   NA   NA  NaN -Inf   NA    0

13.4 Optimizing

In General, clean code is faster and less error prone.

By optimizing repetitive tasks, you end up with code that

  • is cleaner, faster, and more general
  • can be easily parallelized

So, after identifying a bottleneck, try

  1. vectorize your code
  2. use a dedicated package
  3. use parallel computations
  4. compile your code in C++

But remember

  • Don’t waste time optimizing code that is not holding you back.
  • Look at what has already done.

Benchmarking.

For identifying bottlenecks, the simplest approach is to time how long a code-chunk runs

system.time({
    x0 <- runif(1e5)
    x1 <- sqrt(x0)
    x2 <- paste0('J-', x1)
})
##    user  system elapsed 
##   0.236   0.000   0.238

You can visually identify bottlenecks in larger blocks

# Generate Large Random Dataset
n <- 2e6
x <- runif(n)
y <- runif(n)
z <- runif(n)
XYZ <- cbind(x,y,z)

# Inspect 4 equivalent `row mean` calculations 
profvis::profvis({
    m <- rowSums(XYZ)/ncol(XYZ)
    m <- rowMeans(XYZ)
    m <- apply(XYZ, 1, mean)
    m <- rep(NA, n);  for(i in 1:n){ m[i] <- (x[i] + y[i] + z[i]) / 3 }
})
# rowSums(), colSums(), rowMeans(), and colMeans() are vectorised and fast.
# for loop is not the slowest, but the ugliest.

For systematic speed comparisons, try the microbenchmark package

# 3 Equivalent calculations of the mean of a vector
mean1 <- function(x,p=1) mean(x^p)
mean2 <- function(x,p=1) sum(x^p) / length(x)
mean3 <- function(x,p=1) mean.default(x^p)

# Time them
x <- runif(1e6)
microbenchmark::microbenchmark(
  mean1(x,.5),
  mean2(x,.5),
  mean3(x,.5)
)
## Unit: milliseconds
##           expr      min       lq     mean   median       uq      max neval cld
##  mean1(x, 0.5) 32.33647 33.09313 33.52699 33.12929 33.17716 43.16296   100  a 
##  mean2(x, 0.5) 30.39817 31.70594 32.44638 31.73242 31.76939 43.23818   100   b
##  mean3(x, 0.5) 32.35873 33.08039 33.41836 33.11147 33.16153 43.05426   100  a

Vectorize.

Computers are really good at math, so exploit this.

Vector operations are generally faster and easier to read than loops

x <- runif(1e6)

# Compare 2 moving averages

# First Try
ma1 <- function(y){
    z <- y*NA
    for(i in 2:length(y)){
        z[i] <- (y[i]-y[i-1])/2
    }
    return(z)
}

# Optimized using diff
diff( c(2,2,10,9) )
## [1]  0  8 -1
ma2 <- function(y){
    z2 <- diff(y)/2
    z2 <- c(NA, z2) 
    return(z2)
}

all.equal(ma1(y),ma2(y))
## [1] TRUE
microbenchmark::microbenchmark(
  ma1(y),
  ma2(y)
)
## Unit: milliseconds
##    expr       min       lq      mean    median        uq      max neval cld
##  ma1(y) 230.93547 235.1598 239.58282 236.80274 240.88109 262.1204   100  a 
##  ma2(y)  25.82273  27.8113  36.86629  34.88936  38.63891 220.3427   100   b

Likewise, matrix operations are often faster than vector operations.

Packages.

Before creating your own program, check if there is a faster or more memory efficient version. E.g., data.table or Rfast2 for basic data manipulation.

Some functions are simply wrappers for the function you want, and calling it directly can speed things up.

X <- cbind(1, runif(1e6))
Y <- X %*% c(1,2) + rnorm(1e6)
DAT <- as.data.frame(cbind(Y,X))

system.time({ lm(Y~X, data=DAT) })
##    user  system elapsed 
##   0.606   0.053   0.264
system.time({ .lm.fit(X, Y) })
##    user  system elapsed 
##   0.112   0.000   0.040
system.time({ solve(t(X)%*%X) %*% (t(X)%*%Y) })
##    user  system elapsed 
##   0.028   0.000   0.024

Note that such functions to have fewer checks and return less information, so you must know exactly what you are putting in and getting out.

Parallel.

Sometimes there will still be a problematic bottleneck.

Your next step should be parallelism:

  • Write the function as a general vectorized function.
  • Apply the same function to every element in a list at the same time
# lapply in parallel on {m}ultiple {c}ores
x <- c(10,20,30,40,50)
f <- function(element) { element^element } 
parallel::mclapply( x, mc.cores=2, FUN=f)
## [[1]]
## [1] 1e+10
## 
## [[2]]
## [1] 1.048576e+26
## 
## [[3]]
## [1] 2.058911e+44
## 
## [[4]]
## [1] 1.208926e+64
## 
## [[5]]
## [1] 8.881784e+84

More power is often not the solution

# vectorize and compile
e_power_e_fun <- compiler::cmpfun( function(vector){ vector^vector} )

# base R
x <- 0:1E6
s_vc <- system.time( e_power_e_vec <- e_power_e_fun(x) )
s_vc
##    user  system elapsed 
##   0.035   0.003   0.038
# brute power
x <- 0:1E6
s_bp <- system.time({
  e_power_e_mc <- unlist( parallel::mclapply(x, mc.cores=2, FUN=e_power_e_fun))
})
s_bp
##    user  system elapsed 
##   1.340   0.333   0.928
# Same results
all(e_power_e_vec==e_power_e_mc)
## [1] TRUE

Parallelism does not go great with a GUI. For identifying bottlenecks on a cluster without a GUI, try Rprof()

Rprof( interval = 0.005)
    # Create Data
    x <- runif(2e6)
    y <- sqrt(x)
    # Loop Format Data
    z <- y*NA
    for(i in 2:length(y)){ z[i] <- (y[i]-y[i-1])/2 }
    # Regression
    X <- cbind(1,x)[-1,]
    Z <- z[-1]
    reg_fast <- .lm.fit(X, Z)
Rprof(NULL)
summaryRprof()

If you still are stuck, you can

  • try Amazon Web Server for more brute-power
  • rewrite bottlenecks with a working C++ compiler or Fortran compiler.

Before doing that, however, look into https://cran.r-project.org/web/views/HighPerformanceComputing.html

Compiled Code.

You can use C++ code within R to speed up a specific chunk.

To get C++ on your computer

  • On Windows, install Rtools.
  • On Mac, install Xcode from the app store.
  • On Linux, sudo apt-get install r-base-dev or similar.

To call C++ from R use package Rcpp

Rcpp::cppFunction('
  int add(int x, int y, int z) {
    int sum = x + y + z;
    return sum;
  }'
)
add(1, 2, 3)
## [1] 6

For help getting started with Rcpp, see https://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-quickref.pdf

First try to use C++ (or Fortran) code that others have written

.C
.Fortran

For a tutorial, see https://masuday.github.io/fortran_tutorial/r.html

Memory Usage.

For finding problematic blocks or whole scripts: utils::Rprof(memory.profiling = TRUE) logs total memory usage of R at regular time intervals

For finding problematic functions: utils::Rprofmem() logs memory usage at each call

For memory leaks, first free up space and use the bench package for timing

gc() # garbage cleanup

bench::mark(
  mean1(x,.5),
  mean2(x,.5),
  mean3(x,.5))

13.5 More Literature

Advanced Programming

For debugging tips

For optimization tips

For parallel programming

For general tips