13 Large Projects
As you scale up a project, then you will have to be more organized.
Medium and large sized projects should have their own Project
folder on your computer with files, subdirectories with files, and subsubdirectories with files. It should look like this
Project
└── README.txt
└── /Code
└── MAKEFILE.R
└── RBLOCK_001_DataClean.R
└── RBLOCK_002_Figures.R
└── RBLOCK_003_ModelsTests.R
└── RBLOCK_004_Robust.R
└── /Logs
└── MAKEFILE.Rout
└── /Data
└── /Raw
└── Source1.csv
└── Source2.shp
└── Source3.txt
└── /Clean
└── AllDatasets.Rdata
└── MainDataset1.Rds
└── MainDataset2.csv
└── /Output
└── MainFigure.pdf
└── AppendixFigure.pdf
└── MainTable.tex
└── AppendixTable.tex
└── /Writing
└── /TermPaper
└── TermPaper.tex
└── TermPaper.bib
└── TermPaper.pdf
└── /Slides
└── Slides.Rmd
└── Slides.html
└── Slides.pdf
└── /Poster
└── Poster.Rmd
└── Poster.html
└── Poster.pdf
└── /Proposal
└── Proposal.Rmd
└── Proposal.html
└── Proposal.pdf
There are two main meta-files
README.txt
overviews the project structure and what the codes are doingMAKEFILE
explicitly describes and executes all codes (and typically logs the output).
MAKEFILE
If all code is written with the same program (such as R) the makefile can be written in a single language. For us, this looks like
### Project Structure
home_dir <- path.expand("~/Desktop/Project/")
data_dir <- paste0(home_dir, "Data/")
data_dir_r <- paste0(data_dir, "Raw/")
data_dir_c <- paste0(data_dir, "Clean/")
out_dir <- paste0(hdir, "Output/")
code_dir <- paste0(hdir, "Code/")
### Execute Codes
### libraries are loaded within each RBLOCK
set.wd( code_dir )
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )
Notice there is a lot of documentation ### like this
, which is crucial for large projects. Also notice that anyone should be able to replicate the entire project by downloading a zip file and simply changing home_dir
.
If some folders or files need to be created, you can do this within R
# list the files and directories
list.files(recursive=TRUE, include.dirs=TRUE)
# create directory called 'Data'
dir.create('Data')
13.1 Logging/Sinking
When executing the makefile, you can also log the output. Either by
- Inserting some code into the makefile that logs/sinks the output
### Project Structure
home_dir <- path.expand("~/Desktop/Project/")
data_dir <- paste0(home_dir, "Data/")
data_dir_r <- paste0(data_dir, "Raw/")
data_dir_c <- paste0(data_dir, "Clean/")
out_dir <- paste0(hdir, "Output/")
code_dir <- paste0(hdir, "Code/")
### Log Output
set.wd( code_dir )
sink("MAKEFILE.Rout", append=TRUE, split=TRUE)
### Execute Codes
source( "RBLOCK_001_DataClean.R" )
source( "RBLOCK_002_Figures.R" )
source( "RBLOCK_003_ModelsTests.R" )
source( "RBLOCK_004_Robust.R" )
### Stop Logging Output
sink()
- Starting a session that logs/sinks you sourcing the makefile
- Execute the makefile via the commandline
13.2 Class Projects
Zip your project into a single file that is easy for others to identify: Class_Project_LASTNAME_FIRSTNAME.zip
Your code should be readable and error free. For code writing guides, see
- https://google.github.io/styleguide/Rguide.html
- https://style.tidyverse.org/
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/codestyle.html
- http://adv-r.had.co.nz/Style.html
- https://www.burns-stat.com/pages/Tutor/R_inferno.pdf
For organization guidelines, see
- https://guides.lib.berkeley.edu/c.php?g=652220&p=4575532
- https://kbroman.org/steps2rr/pages/organize.html
- https://drivendata.github.io/cookiecutter-data-science/
- https://ecorepsci.github.io/reproducible-science/project-organization.html
For additional logging capabilities, see https://cran.r-project.org/web/packages/logr/
For very large projects, there are many more tools available at https://cran.r-project.org/web/views/ReproducibleResearch.html
For larger scale projects, use scripts
13.3 Debugging
In R, you use multiple functions on different types of data objects. Moreover, you “typically solve complex problems by decomposing them into simple functions, not simple objects.” (H. Wickham)
We can use the following packages to help deal with various problems that may arise
Problems print to the console
message("This is what a message looks like")
warning("This is what a warning looks like")
stop("This is what an error looks like")
## Error in eval(expr, envir, enclos): This is what an error looks like
Nonproblems also print to the console
## cat
## [1] "print"
13.3.1 Tracing
Consider this example of an error process (originally taken from https://adv-r.hadley.nz/ ).
## Let i() check if its argument is numeric
i <- function(i0) {
if ( !is.numeric(i0) ) {
stop("`d` must be numeric", call.=FALSE)
}
i0 + 10
}
## Let f() call g() call h() call i()
h <- function(i0) i(i0)
g <- function(h0) h(h0)
f <- function(g0) g(g0)
## Observe Error
f("a")
## Error: `d` must be numeric
First try simple print debugging
f2 <- function(g0) {
cat("f2 calls g2()\n")
g2(g0)
}
g2 <- function(h0) {
cat("g2 calls h2() \n")
cat("b =", h0, "\n")
h2(h0)
}
h2 <- function(i0) {
cat("h2 call i() \n")
i(i0)
}
f2("a")
## f2 calls g2()
## g2 calls h2()
## b = a
## h2 call i()
## Error: `d` must be numeric
If that fails, try traceback debugging
## No traceback available
And if that fails, try an Interactive approach
## Called from: g3(g0)
## debug at <text>#3: h(h0)
## Error: `d` must be numeric
13.3.2 Isolating
To inspect objects
is.object(f)
is.object(c(1,1))
class(f)
class(c(1,1))
## Storage Mode Type
typeof(f)
typeof(c(1,1))
storage.mode(f)
storage.mode(c(1,1))
To check for valid inputs/outputs
x <- c(NA, NULL, NaN, Inf, 0)
cat("Vector to inspect: ")
x
cat("NA: ")
is.na(x)
cat("NULL: ")
is.null(x)
cat("NaN: ")
is.nan(x)
cat("Finite: ")
is.finite(x)
cat("Infinite: ")
is.infinite(x)
## Many others
To check for values
13.3.3 Handling
Simplest Example
x <- 'A'
tryCatch(
expr = log(x),
error = function(e) {
message('Caught an error but did not break')
print(e)
return(NA)
})
Simple Example
x <- -2
tryCatch(
expr = log(x),
error = function(e) {
message('Caught an error but did not break')
print(e)
return(NA)
},
warning = function(w){
message('Caught a warning!')
print(w)
return(NA)
},
finally = {
message("Returned log(x) if successfull or NA if Error or Warning")
}
)
## <simpleWarning in log(x): NaNs produced>
## [1] NA
Safe Functions
## Define
log_safe <- function(x){
lnx <- tryCatch(
expr = log(x),
error = function(e){
cat('Error Caught: \n\t')
print(e)
return(NA)
},
warning = function(w){
cat('Warning Caught: \n\t')
print(w)
return(NA)
})
return(lnx)
}
## Test
log_safe( 1)
## [1] 0
## Warning Caught:
## <simpleWarning in log(x): NaNs produced>
## [1] NA
## Error Caught:
## <simpleError in log(x): non-numeric argument to mathematical function>
## [1] NA
## Error Caught:
## <simpleError in log(x): non-numeric argument to mathematical function>
## Warning Caught:
## <simpleWarning in log(x): NaNs produced>
## Warning Caught:
## <simpleWarning in log(x): NaNs produced>
## [1] NA Inf NA NA NaN -Inf NA 0
13.4 Optimizing
In General: Clean code is often faster and less error prone
Repetitive tasks can be optimized You end up with code that
- is cleaner, faster, and more general
- can be easily parallelized
Computers have big memories and are really good at math.
- First try vectors
- Then try
apply
functions - See https://uscbiostats.github.io/software-dev-site/handbook-slow-patterns.html
Don’t waste time on code that is not holding you back.
- Your code may break, be slow, or incorrect.
- Look at what has already done.
13.4.1 Benchmarking
The simplest approach is to time how long a code-chunk runs
## user system elapsed
## 0.005 0.000 0.005
For identifying bottlenecks
## Generate Large Random Dataset
n <- 2e6
x <- runif(n)
y <- runif(n)
z <- runif(n)
XYZ <- cbind(x,y,z)
## Inspect 4 equivalent `row mean` calculations
profvis::profvis({
m <- rowSums(XYZ)/ncol(XYZ)
m <- rowMeans(XYZ)
m <- apply(XYZ, 1, mean)
m <- rep(NA, n); for(i in 1:n){ m[i] <- (x[i] + y[i] + z[i]) / 3 }
})
## rowSums(), colSums(), rowMeans(), and colMeans() are vectorised and fast.
## for loop is not the slowest, but the ugliest.
For more systematic speed comparisons, use the microbenchmark
package
## 3 Equivalent calculations of the mean of a vector
mean1 <- function(x,p=1) mean(x^p)
mean2 <- function(x,p=1) sum(x^p) / length(x)
mean3 <- function(x,p=1) mean.default(x^p)
## Time them
x <- runif(1e6)
microbenchmark::microbenchmark(
mean1(x,.5),
mean2(x,.5),
mean3(x,.5)
)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## mean1(x, 0.5) 29.04080 30.31383 32.60656 33.48837 34.12850 36.09136 100 a
## mean2(x, 0.5) 27.86534 29.02359 31.42956 32.50656 32.78087 37.57611 100 b
## mean3(x, 0.5) 28.94673 30.10793 32.60944 33.89489 34.14088 38.64793 100 a
For memory leaks, first free up space and use the bench
package for timing
13.4.2 Speed-Ups
vectorize
Vector operations are generally faster and easier to read than loops
x <- runif(1e6)
## Compare 2 moving averages
## First Try
ma1 <- function(y){
z <- y*NA
for(i in 2:length(y)){
z[i] <- (y[i]-y[i-1])/2
}
return(z)
}
# Optimized using diff
diff( c(2,2,10,9) )
## [1] 0 8 -1
## [1] TRUE
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## ma1(y) 246.48998 292.62278 296.22367 297.36998 306.91321 483.2286 100 a
## ma2(y) 23.96706 27.74139 34.27119 31.16914 35.27275 176.1122 100 b
Likewise, matrix operations are often faster than vector operations.
Packages Before creating your own program, check if there is a faster or more memory efficient version. E.g., data.table or Rfast2 for basic data manipulation.
X <- cbind(1, runif(1e6))
Y <- X %*% c(1,2) + rnorm(1e6)
DAT <- as.data.frame(cbind(Y,X))
system.time({.lm.fit(X, Y) })
## user system elapsed
## 0.537 0.000 0.101
## user system elapsed
## 1.028 0.006 0.262
Note that quicker codes tend to have fewer checks and return less information. So you must know exactly what you are putting in and getting out.
13.4.3 Bottlenecks
Sometimes there will still be a problematic bottleneck.
Your next step should be parallelism:
- Write the function as a general vectorized function.
- Apply the same function to every element in a list at the same time
## lapply in parallel on {m}ultiple {c}ores
x <- c(10,20,30,40,50)
f <- function(element) { element^element }
parallel::mclapply( x, mc.cores=2, FUN=f)
## [[1]]
## [1] 1e+10
##
## [[2]]
## [1] 1.048576e+26
##
## [[3]]
## [1] 2.058911e+44
##
## [[4]]
## [1] 1.208926e+64
##
## [[5]]
## [1] 8.881784e+84
More power is often not the solution
## vectorize and compile
e_power_e_fun <- compiler::cmpfun( function(vector){ vector^vector} )
## base R
x <- 0:1E6
s_vc <- system.time( e_power_e_vec <- e_power_e_fun(x) )
s_vc
## user system elapsed
## 0.033 0.001 0.034
## brute power
x <- 0:1E6
s_bp <- system.time({
e_power_e_mc <- unlist( parallel::mclapply(x, mc.cores=2, FUN=e_power_e_fun))
})
s_bp
## user system elapsed
## 1.397 0.257 0.937
## [1] TRUE
Parallelism does not go great with a GUI.
For identifying bottlenecks on a cluster without a GUI, try Rprof()
Rprof( interval = 0.005)
# Create Data
x <- runif(2e6)
y <- sqrt(x)
## Loop Format Data
z <- y*NA
for(i in 2:length(y)){ z[i] <- (y[i]-y[i-1])/2 }
## Regression
X <- cbind(1,x)[-1,]
Z <- z[-1]
reg_fast <- .lm.fit(X, Z)
Rprof(NULL)
summaryRprof()
If you still are stuck, you can
- try Amazon Web Server for more brute-power
- rewrite bottlenecks with a working C++ compiler or Fortran compiler.
Before doing that, however, look into https://cran.r-project.org/web/views/HighPerformanceComputing.html
13.4.3.0.1 Compiled Code
To get C++
- On Windows, install Rtools.
- On Mac, install Xcode from the app store.
- On Linux, sudo apt-get install r-base-dev or similar.
To call C++ from R use package Rcpp
Rcpp::cppFunction('
int add(int x, int y, int z) {
int sum = x + y + z;
return sum;
}'
)
add(1, 2, 3)
## [1] 6
For help getting started with Rcpp, see https://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-quickref.pdf
First try to use C++ (or Fortran) code that others have written
For a tutorial, see https://masuday.github.io/fortran_tutorial/r.html
13.5 More Literature
Advanced Programming
- https://rmd4sci.njtierney.com/
- https://smac-group.github.io/ds/high-performance-computing.html
- https://www.stat.umn.edu/geyer/3701/notes/arithmetic.Rmd
For debugging tips
- https://cran.r-project.org/doc/manuals/R-lang.html#Debugging
- https://cran.r-project.org/doc/manuals/R-exts.html#Debugging
- https://adv-r.hadley.nz/debugging.html
- https://adv-r.hadley.nz/conditions.html
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/debugging.html
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/functions.html
For optimization tips
- https://cran.r-project.org/doc/manuals/R-exts.html#Tidying-and-profiling-R-code
- https://cran.r-project.org/doc/manuals/R-lang.html#Exception-handling
- https://adv-r.hadley.nz/perf-measure.html.libPaths()
- https://adv-r.hadley.nz/perf-improve.html
- https://cran.r-project.org/doc/manuals/R-exts.html#System-and-foreign-language-interfaces https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/profiling.html
- https://adv-r.hadley.nz/rcpp.html
- https://bookdown.dongzhuoer.com/hadley/adv-r/
For parallel programming
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html
- https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html
- https://grantmcdermott.com/ds4e/parallel.html
- https://psu-psychology.github.io/r-bootcamp-2018/talks/parallel_r.html
For general tips
- https://github.com/christophergandrud/Rep-Res-Book
- Efficient R programming. C. Gillespie R. Lovelace. 2021. https://csgillespie.github.io/efficientR/
- Data Science at the Command Line, 1e. Janssens J. 2020. https://www.datascienceatthecommandline.com/1e/
- R Programming for Data Science. Peng R. 2020. https://bookdown.org/rdpeng/rprogdatascience/
- Advanced R. H. Wickham 2019. https://adv-r.hadley.nz/
- Econometrics in R. Grant Farnsworth. 2008. http://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf
- The R Inferno. https://www.burns-stat.com/documents/books/the-r-inferno/