17 Performance
17.1 Debugging
In R, you use multiple functions on different types of data objects. Moreover, you “typically solve complex problems by decomposing them into simple functions, not simple objects.” (H. Wickham)
Problems print to the console
## Error: This is what an error looks like
Nonproblems also print to the console
## cat
## [1] "print"
Tracing.
Consider this example of an error process (originally taken from https://adv-r.hadley.nz/ ).
# Let i() check if its argument is numeric
i <- function(i0) {
if ( !is.numeric(i0) ) {
stop("`d` must be numeric", call.=FALSE)
}
i0 + 10
}
# Let f() call g() call h() call i()
h <- function(i0) i(i0)
g <- function(h0) h(h0)
f <- function(g0) g(g0)
# Observe Error
f("a")
## Error: `d` must be numeric
First try simple print debugging
f2 <- function(g0) {
cat("f2 calls g2()\n")
g2(g0)
}
g2 <- function(h0) {
cat("g2 calls h2() \n")
cat("b =", h0, "\n")
h2(h0)
}
h2 <- function(i0) {
cat("h2 call i() \n")
i(i0)
}
f2("a")
## f2 calls g2()
## g2 calls h2()
## b = a
## h2 call i()
## Error: `d` must be numeric
If that fails, try traceback debugging
## No traceback available
And if that fails, try an Interactive approach
## Called from: g3(g0)
## debug: h(h0)
## Error: `d` must be numeric
Isolating.
To inspect objects
is.object(f)
is.object(c(1,1))
class(f)
class(c(1,1))
# Storage Mode Type
typeof(f)
typeof(c(1,1))
storage.mode(f)
storage.mode(c(1,1))
To check for valid inputs/outputs
x <- c(NA, NULL, NaN, Inf, 0)
cat("Vector to inspect: ")
x
cat("NA: ")
is.na(x)
cat("NULL: ")
is.null(x)
cat("NaN: ")
is.nan(x)
cat("Finite: ")
is.finite(x)
cat("Infinite: ")
is.infinite(x)
# Many others
To check for values
all( x > -2 )
any( x > -2 )
# Check Matrix Rows
rowAny <- function(x) rowSums(x) > 0
rowAll <- function(x) rowSums(x) == ncol(x)
Handling.
Simplest example
x <- 'A'
tryCatch(
expr = log(x),
error = function(e) {
message('Caught an error but did not break')
print(e)
return(NA)
})
Another example
x <- -2
tryCatch(
expr = log(x),
error = function(e) {
message('Caught an error but did not break')
print(e)
return(NA)
},
warning = function(w){
message('Caught a warning!')
print(w)
return(NA)
},
finally = {
message("Returned log(x) if successfull or NA if Error or Warning")
}
)
## <simpleWarning in log(x): NaNs produced>
## [1] NA
Safe Functions
# Define
log_safe <- function(x){
lnx <- tryCatch(
expr = log(x),
error = function(e){
cat('Error Caught: \n\t')
print(e)
return(NA)
},
warning = function(w){
cat('Warning Caught: \n\t')
print(w)
return(NA)
})
return(lnx)
}
# Test
log_safe( 1)
## [1] 0
## Warning Caught:
## <simpleWarning in log(x): NaNs produced>
## [1] NA
## Error Caught:
## <simpleError in log(x): non-numeric argument to mathematical function>
## [1] NA
## Error Caught:
## <simpleError in log(x): non-numeric argument to mathematical function>
## Warning Caught:
## <simpleWarning in log(x): NaNs produced>
## Warning Caught:
## <simpleWarning in log(x): NaNs produced>
## [1] NA Inf NA NA NaN -Inf NA 0
17.2 Optimizing
In General, clean code is faster and less error prone.
By optimizing repetitive tasks, you end up with code that
- is cleaner, faster, and more general
- can be easily parallelized
So, after identifying a bottleneck, try
- vectorize your code
- use a dedicated package
- use parallel computations
- compile your code in C++
But remember
- Don’t waste time optimizing code that is not holding you back.
- Look at what has already done.
Benchmarking.
For identifying bottlenecks, the simplest approach is to time how long a code-chunk runs
## user system elapsed
## 0.130 0.008 0.139
You can visually identify bottlenecks in larger blocks
# Generate Large Random Dataset
n <- 2e6
x <- runif(n)
y <- runif(n)
z <- runif(n)
XYZ <- cbind(x,y,z)
# Inspect 4 equivalent `row mean` calculations
profvis::profvis({
m <- rowSums(XYZ)/ncol(XYZ)
m <- rowMeans(XYZ)
m <- apply(XYZ, 1, mean)
m <- rep(NA, n); for(i in 1:n){ m[i] <- (x[i] + y[i] + z[i]) / 3 }
})
# rowSums(), colSums(), rowMeans(), and colMeans() are vectorised and fast.
# for loop is not the slowest, but the ugliest.
For systematic speed comparisons, try a benchmarking package
# 3 Equivalent calculations of the mean of a vector
mean1 <- function(x,p=1) mean(x^p)
mean2 <- function(x,p=1) sum(x^p) / length(x)
mean3 <- function(x,p=1) mean.default(x^p)
# Time them
x <- runif(1e6)
microbenchmark::microbenchmark(
mean1(x,.5),
mean2(x,.5),
mean3(x,.5),
times=20
)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## mean1(x, 0.5) 21.22159 24.05092 27.19083 26.93219 29.28603 38.83637 20 a
## mean2(x, 0.5) 22.64585 23.40534 25.91066 24.89418 27.29905 35.28588 20 a
## mean3(x, 0.5) 21.49653 24.04798 26.32646 25.70899 29.43275 31.42388 20 a
## # A tibble: 3 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 mean1(x, 0.5) 20.8ms 22.7ms 44.7 7.63MB 0
## 2 mean2(x, 0.5) 19.4ms 22ms 47.2 7.63MB 2.49
## 3 mean3(x, 0.5) 20.8ms 23.1ms 44.2 7.63MB 2.33
Vectorize.
Computers are really good at math, so exploit this.
- First try vectors
- Then try
apply
functions - See https://uscbiostats.github.io/software-dev-site/handbook-slow-patterns.html
Vector operations are generally faster and easier to read than loops
# Compare 2 moving averages
x <- runif(2e6)
# First Try
ma1 <- function(y){
z <- y*NA
for(i in 2:length(y)){
z[i] <- (y[i]-y[i-1])/2
}
return(z)
}
# Optimized using diff
diff( c(2,2,10,9) )
## [1] 0 8 -1
## [1] TRUE
## user system elapsed
## 0.153 0.000 0.154
## user system elapsed
## 0.018 0.011 0.029
## user system elapsed
## 0.02 0.00 0.02
Likewise, matrix operations are often faster than vector operations.
Memory Usage.
For finding problematic blocks utils::Rprof(memory.profiling = TRUE)
logs total memory usage of R at regular time intervals. E.g.
Rprof( interval = 0.005)
# Create Data
x <- runif(2e6)
y <- sqrt(x)
# Loop Format Data
z <- y*NA
for(i in 2:length(y)){ z[i] <- (y[i]-y[i-1])/2 }
# Regression
X <- cbind(1,x)[-1,]
Z <- z[-1]
reg_fast <- .lm.fit(X, Z)
Rprof(NULL)
summaryRprof()
For finding problematic functions: utils::Rprofmem()
logs memory usage at each call
For finding problematic scripts, see “Advanced Optimization”. (With Rprof, you can identifying bottlenecks on a cluster without a GUI.)
Packages.
Before creating your own program, check if there is a faster or more memory efficient version. E.g., data.table or Rfast2 for basic data manipulation.
Some functions are simply wrappers for the function you want, and calling it directly can speed things up.
X <- cbind(1, runif(1e6))
Y <- X %*% c(1,2) + rnorm(1e6)
DAT <- as.data.frame(cbind(Y,X))
system.time({ lm(Y~X, data=DAT) })
## user system elapsed
## 0.550 0.010 0.179
## user system elapsed
## 0.104 0.000 0.030
## user system elapsed
## 0.017 0.000 0.015
Note that such functions to have fewer checks and return less information, so you must know exactly what you are putting in and getting out.
Parallel.
Sometimes there will still be a problematic bottleneck.
Your next step should be parallelism:
- Write the function as a general vectorized function.
- Apply the same function to every element in a list at the same time
# lapply in parallel on {m}ultiple {c}ores
x <- c(10,20,30,40,50)
f <- function(element) { element^element }
parallel::mclapply( x, mc.cores=2, FUN=f)
## [[1]]
## [1] 1e+10
##
## [[2]]
## [1] 1.048576e+26
##
## [[3]]
## [1] 2.058911e+44
##
## [[4]]
## [1] 1.208926e+64
##
## [[5]]
## [1] 8.881784e+84
More power is often not the solution
# vectorize and compile
e_power_e_fun <- compiler::cmpfun( function(vector){ vector^vector} )
# base R
x <- 0:1E6
s_vc <- system.time( e_power_e_vec <- e_power_e_fun(x) )
s_vc
## user system elapsed
## 0.020 0.001 0.021
# brute power
x <- 0:1E6
s_bp <- system.time({
e_power_e_mc <- unlist( parallel::mclapply(x, mc.cores=2, FUN=e_power_e_fun))
})
s_bp
## user system elapsed
## 1.186 0.204 0.878
## [1] TRUE
Note that parallelism does not go well with a GUI.
17.3 Advanced Optimizing
If you still are stuck, you can
- try Amazon Web Server for more brute-power
- rewrite bottlenecks with a working C++ compiler or Fortran compiler.
Before doing that, however, look into https://cran.r-project.org/web/views/HighPerformanceComputing.html
In what follows, note that there are alternative ways to run R via the command line. For example,
Cluster Computing
For parallel computations on a computer cluster, you will need to use both R and the linux command line.
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
## [1] "A 1" "D 2"
R --slave -e '
.libPaths("~/R-Libs")
options(repos="https://cloud.r-project.org/")
update.packages(ask=F)
suppressPackageStartupMessages(library(stargazer))
'
You will often want to rerun entire scripts with a different parameter. To do so, you need to edit your R scripts to accept parameters from the command line
For example
R --slave -e '
args <- commandArgs(TRUE)
paste0(1, args)
' --args a b c
R --slave -e '
args <- commandArgs(TRUE)
my_numbers <- as.numeric(args)
my_numbers + 1
' --args $(seq 1 10)
## [1] "1a" "1b" "1c"
## [1] 2 3 4 5 6 7 8 9 10 11
You can also store the output and computer resources of a script. For example, save the last script as RBLOCK.R
in the folder $HOME/R_Code
and run the following
# Which Code
RDIR=$HOME/R_Code #main directory
infile=$RDIR/RBLOCK.R #specific code
outfile=$RDIR/R_Logs/RBLOCK$design_N.Rout #log R output
memfile=$RDIR/R_Logs/RBLOCK$design_N.Rtime #log computer resources
# Execute the Script and store resource useage via `time`
command time -o $memfile -v \
R CMD BATCH --no-save --quiet --no-restore "--args 1 2 3" $infile $outfile
Note that you need to have https://ftp.gnu.org/gnu/time installed
On an academic computing cluster, you may have to use a scheduler like slurm. In which case you can submit a bash script
where SlurmJob.sh
looks like
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --error=$HOME/R_Code/Slurm_Logs/%x.error-%j
#SBATCH --output=$HOME/R_Code/Slurm_Logs/%x.out-%j
# Which Code
RDIR=$HOME/R_Code
infile=$RDIR/RBLOCK.R
outfile=$RDIR/R_Logs/RBLOCK$design_N.Rout
memfile=$RDIR/R_Logs/RBLOCK$design_N.Rtime
# Which Parameter
a="${SLURM_ARRAY_TASK_ID}"
# Execute the Script with a specific parameter, and store memory/time useage
command time -o $memfile -v \
R CMD BATCH --no-save --quiet --no-restore "--args $a" $infile $outfile
# Summarize memory/time useage
echo "design_N: $design_N Gridpoints"
cat $memfile | awk '/User time/ {printf "Time: %.2f Hours\n", $4/3600}'
cat $memfile | awk '/Maximum resident set size/ {printf "Memory: %.2f GB\n", $6/1048576}'
echo "Partition: $SLURM_JOB_PARTITION"
Compiled Code.
You can use C++ code within R to speed up a specific chunk.
To get C++ on your computer
- On Windows, install Rtools.
- On Mac, install Xcode from the app store.
- On Linux, sudo apt-get install r-base-dev or similar.
To call C++ from R use package Rcpp
Rcpp::cppFunction('
int add(int x, int y, int z) {
int sum = x + y + z;
return sum;
}'
)
add(1, 2, 3)
## [1] 6
For help getting started with Rcpp, see https://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-quickref.pdf
First try to use C++ (or Fortran) code that others have written
For a tutorial, see https://masuday.github.io/fortran_tutorial/r.html
17.4 More Literature
Advanced Programming
- https://rmd4sci.njtierney.com/
- https://smac-group.github.io/ds/high-performance-computing.html
- https://www.stat.umn.edu/geyer/3701/notes/arithmetic.Rmd
For debugging tips
- https://cran.r-project.org/doc/manuals/R-lang.html#Debugging
- https://cran.r-project.org/doc/manuals/R-exts.html#Debugging
- https://adv-r.hadley.nz/debugging.html
- https://adv-r.hadley.nz/conditions.html
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/debugging.html
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/functions.html
For optimization tips
- https://cran.r-project.org/doc/manuals/R-exts.html#Tidying-and-profiling-R-code
- https://cran.r-project.org/doc/manuals/R-lang.html#Exception-handling
- https://adv-r.hadley.nz/perf-measure.html.libPaths()
- https://adv-r.hadley.nz/perf-improve.html
- https://cran.r-project.org/doc/manuals/R-exts.html#System-and-foreign-language-interfaces https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/profiling.html
- https://adv-r.hadley.nz/rcpp.html
- https://bookdown.dongzhuoer.com/hadley/adv-r/
For parallel programming
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html
- https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html
- https://grantmcdermott.com/ds4e/parallel.html
- https://psu-psychology.github.io/r-bootcamp-2018/talks/parallel_r.html
For general tips
- https://github.com/christophergandrud/Rep-Res-Book
- Efficient R programming. C. Gillespie R. Lovelace. 2021. https://csgillespie.github.io/efficientR/
- Data Science at the Command Line, 1e. Janssens J. 2020. https://www.datascienceatthecommandline.com/1e/
- R Programming for Data Science. Peng R. 2020. https://bookdown.org/rdpeng/rprogdatascience/
- Advanced R. H. Wickham 2019. https://adv-r.hadley.nz/
- Econometrics in R. Grant Farnsworth. 2008. http://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf
- The R Inferno. https://www.burns-stat.com/documents/books/the-r-inferno/