Optimizing R Code

It is easy to write slow code in R due to the ecosystem it provides. When processing vast amounts of data, the performance of code is critical to minimize execution time. There is a limitation on how fast R can go, but with knowledge of how code is executed and the algorithim being used the speed is comparable to most other statistical programming langauges.
Terminology:
- User time - CPU time charged for the execution of user instruction of the calling process
- System time - CPU time charged for execution by the system on behalf of the calling process
R system.time function returns something like:
user system elapsed
12.080 0.160 12.241
We are usaully most interested in the total elapsed time of the execution, but there are cases when the user or system time is useful.
Benchmarking Tools
microbenchmark
for evaulating the time it takes to run a command- We could use
system.time(<expression>)
but it may return 0 for functions that run fast
- We could use
profvis
for code profiling
Allocating Memory Allocation
R loops have a reputation for being slow, but this isn't always true. Most of the time loops run slowly when improperly implemented with dynamic memory allocation. Consider the following example:
# Growing Matrix = BAD
matr <-NULL
system.time( for(i in seq(1e4)) matr <-rbind(matr, 1:10) )
user system elapsed
4.447 0.012 4.454
# Pre-Allocated Matrix = GOOD
matr <- matrix(NA, nrow=1e4, ncol=10)
system.time( for(i in seq(1e4) ) matr[i,] <- 1:10)
user system elapsed
0.013 0.000 0.013
Consider the loops above which create sequences in two different ways. In the first loop with rbind
every iteration of the loop allocates a new larger chunk of memory, copy the previous, calculate current values, and then de-allocates old memory. Avoid growing dataframes and matrices by allocating all memory at the start!
Vectorization
Compare the following 2 methods of computing a logarithm of each element of a vector then adding all elements together:
x <- rnorm( 1e4, 100,2 )
microbenchmark({
# Slow method:
total <- 0
for(i in seq_along(x)) total <- total + log(x[i])
},
{
# Faster:
log_sum <- sum(log(x))
})
Unit: microseconds
expr
{ total <- 0 for (i in seq_along(x)) total <- total + log(x[i]) }
{ log_sum <- sum(log(x)) }
min lq mean median uq max neval cld
2871.843 2928.1920 3171.2189 2971.4760 3333.7445 8537.560 100 b
314.734 319.7805 335.6021 323.7565 357.1405 366.385 100 a
So we observe that computing the entire vector at once is much more efficent than doing it one-by-one in a loop.
In the loop:
log()
is called 1e4 times with parameteres- It allocates memory 1e4 times to store the resulting value
- It needs to execute the addition the perform the assignment operation 1e4 times
Wheras in the vectorized approach:
log()
is called only once- Memory is allocated once (though for a larger object)
sum()
is called once and performs assignment operation once
The vectorized approach is actually also using a loop at it's core, but it's implemented differently so that it is much faster. When possible there are many vectorized operations that outperform manual implementation in R. For instance, the ifelse()
function is more efficient that using a if (...){} else(...){}
within a loop. Another example is using seq_along()
function instead of 1:length(x)
to create sequence vectors.
microbenchmark(
apply(matr, 1, mean),
rowMeans(matr) )
Unit: microseconds
expr min lq mean median uq max
apply(matr, 1, mean) 570.057 580.6060 589.59984 587.9365 595.4905 710.220
rowMeans(matr) 24.747 25.5735 26.89985 27.2620 27.7635 36.419
Data Frames vs Matrices
Matrixes are faster than data frames since they are smaller, and thus they are more efficient. Many R functions coerce dataframes into matrices data. Of course matrices can only contain a single type of data while data frames are a list of vectors, which make ideal for complex datasets.
microbenchmark(
df = data.frame(
col1 = sample(1:10, 30, replace = TRUE)
),
mat = matrix(sample(1:10, 30, replace = TRUE), nrow=30)
)
Unit: microseconds
expr min lq mean median uq max neval cld
df 130.034 134.467 155.33058 135.7825 138.6015 1788.198 100 b
mat 7.121 8.341 30.29864 9.7170 13.3075 1960.421 100 a
Each column of a data frame is stored seperately in memory. The most efficient way of extracting a column from a dataframe is by using the column name:
microbenchmark(
"[50, 3]" = df[50,3],
"$V3[50]" = df$V3[50]
)
Unit: nanoseconds
expr min lq mean median uq max neval cld
[50, 3] 11107 11761.5 12669.46 12035.5 12529.5 52854 100 b
$V3[50] 774 1066.0 1352.95 1233.0 1435.5 12708 100 a
Keep It Simple Stupid
Above all else, write your code to be as simple as possible. Read the function documentation to understand which arguments can be included to avoid extra calculations. For example the unlist()
function returns a named vector by default, but if we are interested in the values only we could use use.names=FALSE
to get a better performance:
# Create a list
my.list <- split(sample(1000, size = 100000, rep=TRUE), rep(1:10000, times=10))
# Use unlist function to convert list to a vector
microbenchmark(
v <- unlist(my.list),
v <- unlist(my.list, use.names = FALSE) )
Unit: microseconds
expr min lq mean
v <- unlist(my.list) 29126.201 29151.1735 29382.2819
v <- unlist(my.list, use.names = FALSE) 252.199 255.9045 260.1722
pmax() and pmin() and max() and min()
min
and max
return the maximum and minimum values of all values present. pmax
and pmin
(parallel max/min) take one or more vectors or matrices and return a max/min of each element.
x <- c(4, 19, 23, 3, 20)
y <- c(9, 24, 21, 4, NA)
z <- pmax(x, y, 5) # vector as long as larger of x and y
# where z[i] is max of x[i], y[i], and 5
z
[1] 9 24 23 5 NA
These functions' efficiency depend on the dimensions of your data. If the dataframe has more rows than columns, pmin and pmax perform well:
microbenchmark(
apply(df, 1, max),
do.call( pmax, df) ,
rowMaxs(as.matrix(df)), # from matrixStats package
times = 3)
Unit: microseconds
expr min lq mean median uq
apply(df, 1, max) 16939.440 17061.514 17148.314 17183.587 17252.751
do.call(pmax, df) 781.224 785.802 787.852 790.380 791.166
rowMaxs(as.matrix(df)) 964.132 972.653 1005.031 981.174 1025.481
However, when the number rows is many pmax and pmin may no longer be suitable:
x <- matrix(rnorm(1e7,0,1),nrow=10 )
df <- as.data.frame(x)
microbenchmark(
apply(df, 1, max),
do.call( pmax, df) ,
times = 3)
Unit: seconds
expr min lq mean median uq max neval
apply(df, 1, max) 5.929419 6.316609 6.627789 6.703799 6.976974 7.250150 3
do.call(pmax, df) 2.648240 3.067606 3.289333 3.486972 3.609880 3.732787 3
Profiling R Code
In addition to benchmarking the total execution time, one can identify bottlenecks in the code by creating a step-by-step execution summary using the profvis
library
slowFun = function(x) {
...
}
# Start profiler
Rprof (interval = 0.01)
# run code
x.nomiss <- slowFun(x)
# end profiler
Rprof(NULL)
# view results
summaryRprof()
Parallelization
R also provides many core libraries for parallel computing, such as parallel
. These can be used for multi-threaded or even multi-cluster distribution of processes.
There are generally two ways of using parallelization; SOCK (socket workers start with an empty enviornment and datasets must be passed to the cluster) and FORK (creates N copies of workers which is a copy of master process, does not work on windows).
Parallelization works with for loops or apply functions. Let's review the type of apply functions in R:
- apply: apply(X, FUN)
- lapply: a variety of apply() that takes in a vector, a list, or a DataFrame as input and always outputs a list
- sapply: a simplified form of lapply(). It takes in a vector, a list, or a DataFrame as input, just as lapply() does, and tries to reduce the output object to the most simplified data structure. That means that, by default, the sapply() function outputs a vector for a vector, a list for a list, and a matrix for a DataFrame
- mapply: apply a function to multiple list or vector arguments
- tapply: apply a function over a ragged array
- rapply: recursively apply a function to a list
detectCores(logical = FALSE) # number of physical CPU cores
detectCores(logical = TRUE) # number of logical (hyperthreaded) CPU cores
ncores <- detectCores(logical = FALSE) -1
# Simple parallel application
# Creates a set of copies of R running in parallel and communicating over sockets
cl <- makeCluster( ncores )
( result <- clusterApply( cl, x = 1:5, fun = factorial) )
# What is the process ID of the workers?
clusterCall(cl, Sys.getpid)
# It is good practice to shut down the workers by calling stopCluster
# even though they should shut down once the listening socket becomes unavailable
stopCluster(cl)
Remember: Never use all of your cores! The computer needs at least one for basic operating system functionallities.
There are many ways to do parellel computing in R, I only describe the basics here.
Conclusion
- Preallocate memory for vectors, matrices, or dataframes
- Avoid
rbind
andcbind
whenever possible - Matrix operations are faster than
apply
functions which are faster than loops (usaully), and existing functions are fastest - Use vectorized function (such as
rowSums
,colSums
,ifelse
, etc.) when possible, as they are implemented in C - When using all numeric data matrices will be faster. Creating and and accessing rows in a data.frame takes significantly more time
- Data frame columns should be indexed by name rather than column number
- Use conditional
if
statements to do as little computation as possible - On extremely large datasets parallel computing can be used, but be careful
- Sometimes it's hard to tell which approach is best, use benchmarking to test it
If still not getting the speeds required, it is possible to write C/C++ code and execute it with the Rcpp library. R is getting faster with every version, but the algorithim matters more than the language.