tutorial

Optimizing R Code

Mitchell Henschel

Jun 7, 2023 • 7 min read

Optimization and Parallelization in R

It is easy to write slow code in R due to the ecosystem it provides. When processing vast amounts of data, the performance of code is critical to minimize execution time. There is a limitation on how fast R can go, but with knowledge of how code is executed and the algorithim being used the speed is comparable to most other statistical programming langauges.

Terminology:

User time - CPU time charged for the execution of user instruction of the calling process
System time - CPU time charged for execution by the system on behalf of the calling process
R system.time function returns something like:

user system elapsed
12.080 0.160 12.241

We are usaully most interested in the total elapsed time of the execution, but there are cases when the user or system time is useful.

Benchmarking Tools

microbenchmark for evaulating the time it takes to run a command
- We could use system.time(<expression>) but it may return 0 for functions that run fast
profvis for code profiling

Allocating Memory Allocation

R loops have a reputation for being slow, but this isn't always true. Most of the time loops run slowly when improperly implemented with dynamic memory allocation. Consider the following example:

# Growing Matrix = BAD
matr <-NULL
system.time( for(i in seq(1e4)) matr <-rbind(matr, 1:10) )

user system elapsed
4.447 0.012 4.454

# Pre-Allocated Matrix = GOOD
matr <- matrix(NA, nrow=1e4, ncol=10)
system.time( for(i in seq(1e4) ) matr[i,] <- 1:10)

user system elapsed
0.013 0.000 0.013

Consider the loops above which create sequences in two different ways. In the first loop with rbind every iteration of the loop allocates a new larger chunk of memory, copy the previous, calculate current values, and then de-allocates old memory. Avoid growing dataframes and matrices by allocating all memory at the start!

Vectorization

Compare the following 2 methods of computing a logarithm of each element of a vector then adding all elements together:

x <- rnorm( 1e4, 100,2 )

microbenchmark({
  # Slow method:
  total <- 0
  for(i in seq_along(x)) total <- total + log(x[i])
},

{
  # Faster:
  log_sum <- sum(log(x))
})

Unit: microseconds
expr
{ total <- 0 for (i in seq_along(x)) total <- total + log(x[i]) }
{ log_sum <- sum(log(x)) }
min lq mean median uq max neval cld
2871.843 2928.1920 3171.2189 2971.4760 3333.7445 8537.560 100 b
314.734 319.7805 335.6021 323.7565 357.1405 366.385 100 a

So we observe that computing the entire vector at once is much more efficent than doing it one-by-one in a loop.
In the loop:

log() is called 1e4 times with parameteres
It allocates memory 1e4 times to store the resulting value
It needs to execute the addition the perform the assignment operation 1e4 times

Wheras in the vectorized approach:

log() is called only once
Memory is allocated once (though for a larger object)
sum() is called once and performs assignment operation once

The vectorized approach is actually also using a loop at it's core, but it's implemented differently so that it is much faster. When possible there are many vectorized operations that outperform manual implementation in R. For instance, the ifelse() function is more efficient that using a if (...){} else(...){} within a loop. Another example is using seq_along() function instead of 1:length(x) to create sequence vectors.

microbenchmark( 
  apply(matr, 1, mean),
  rowMeans(matr) )

Unit: microseconds
expr min lq mean median uq max
apply(matr, 1, mean) 570.057 580.6060 589.59984 587.9365 595.4905 710.220
rowMeans(matr) 24.747 25.5735 26.89985 27.2620 27.7635 36.419

Data Frames vs Matrices

Matrixes are faster than data frames since they are smaller, and thus they are more efficient. Many R functions coerce dataframes into matrices data. Of course matrices can only contain a single type of data while data frames are a list of vectors, which make ideal for complex datasets.

microbenchmark(
  df =   data.frame(
    col1 = sample(1:10, 30, replace = TRUE)
  ),
  mat = matrix(sample(1:10, 30, replace = TRUE), nrow=30)
)

Unit: microseconds
expr min lq mean median uq max neval cld
df 130.034 134.467 155.33058 135.7825 138.6015 1788.198 100 b
mat 7.121 8.341 30.29864 9.7170 13.3075 1960.421 100 a

Each column of a data frame is stored seperately in memory. The most efficient way of extracting a column from a dataframe is by using the column name:

microbenchmark(
  "[50, 3]" = df[50,3],
  "$V3[50]" = df$V3[50]
)

Unit: nanoseconds
expr min lq mean median uq max neval cld
[50, 3] 11107 11761.5 12669.46 12035.5 12529.5 52854 100 b
$V3[50] 774 1066.0 1352.95 1233.0 1435.5 12708 100 a

Keep It Simple Stupid

Above all else, write your code to be as simple as possible. Read the function documentation to understand which arguments can be included to avoid extra calculations. For example the unlist() function returns a named vector by default, but if we are interested in the values only we could use use.names=FALSE to get a better performance:

# Create a list
my.list <- split(sample(1000, size = 100000, rep=TRUE), rep(1:10000, times=10))

# Use unlist function to convert list to a vector
microbenchmark(
  v <- unlist(my.list), 
  v <- unlist(my.list, use.names = FALSE) )

Unit: microseconds
expr min lq mean
v <- unlist(my.list) 29126.201 29151.1735 29382.2819
v <- unlist(my.list, use.names = FALSE) 252.199 255.9045 260.1722

pmax() and pmin() and max() and min()

min and max return the maximum and minimum values of all values present. pmax and pmin (parallel max/min) take one or more vectors or matrices and return a max/min of each element.

x <- c(4, 19, 23, 3, 20)
y <- c(9, 24, 21, 4, NA)
z <- pmax(x, y, 5) # vector as long as larger of x and y
     # where z[i] is max of x[i], y[i], and 5
z

[1] 9 24 23 5 NA

These functions' efficiency depend on the dimensions of your data. If the dataframe has more rows than columns, pmin and pmax perform well:

microbenchmark( 
  apply(df, 1, max),
  do.call( pmax, df) , 
  rowMaxs(as.matrix(df)),   # from matrixStats package
  times = 3)

Unit: microseconds
expr min lq mean median uq
apply(df, 1, max) 16939.440 17061.514 17148.314 17183.587 17252.751
do.call(pmax, df) 781.224 785.802 787.852 790.380 791.166
rowMaxs(as.matrix(df)) 964.132 972.653 1005.031 981.174 1025.481

However, when the number rows is many pmax and pmin may no longer be suitable:

x <- matrix(rnorm(1e7,0,1),nrow=10 )
df <- as.data.frame(x)

microbenchmark( 
  apply(df, 1, max),
  do.call( pmax, df) , 
  times = 3)

Unit: seconds
expr min lq mean median uq max neval
apply(df, 1, max) 5.929419 6.316609 6.627789 6.703799 6.976974 7.250150 3
do.call(pmax, df) 2.648240 3.067606 3.289333 3.486972 3.609880 3.732787 3

Profiling R Code

In addition to benchmarking the total execution time, one can identify bottlenecks in the code by creating a step-by-step execution summary using the profvis library

slowFun = function(x) {
    ...
}
# Start profiler
Rprof (interval = 0.01)

  # run code
   x.nomiss <- slowFun(x)
   
# end profiler
Rprof(NULL)


# view results
summaryRprof()

Parallelization

R also provides many core libraries for parallel computing, such as parallel. These can be used for multi-threaded or even multi-cluster distribution of processes.

There are generally two ways of using parallelization; SOCK (socket workers start with an empty enviornment and datasets must be passed to the cluster) and FORK (creates N copies of workers which is a copy of master process, does not work on windows).

Parallelization works with for loops or apply functions. Let's review the type of apply functions in R:

apply: apply(X, FUN)
lapply: a variety of apply() that takes in a vector, a list, or a DataFrame as input and always outputs a list
sapply: a simplified form of lapply(). It takes in a vector, a list, or a DataFrame as input, just as lapply() does, and tries to reduce the output object to the most simplified data structure. That means that, by default, the sapply() function outputs a vector for a vector, a list for a list, and a matrix for a DataFrame
mapply: apply a function to multiple list or vector arguments
tapply: apply a function over a ragged array
rapply: recursively apply a function to a list

detectCores(logical = FALSE) # number of physical CPU cores
detectCores(logical = TRUE)  # number of logical (hyperthreaded) CPU cores 

ncores <- detectCores(logical = FALSE) -1

# Simple parallel application
# Creates a set of copies of R running in parallel and communicating over sockets
cl <- makeCluster( ncores )

( result <- clusterApply( cl, x = 1:5, fun = factorial) )

# What is the process ID of the workers?
clusterCall(cl, Sys.getpid)

# It is good practice to shut down the workers by calling stopCluster
# even though they should shut down once the listening socket becomes unavailable
stopCluster(cl)

Remember: Never use all of your cores! The computer needs at least one for basic operating system functionallities.

There are many ways to do parellel computing in R, I only describe the basics here.

Conclusion

Preallocate memory for vectors, matrices, or dataframes
Avoid rbind and cbind whenever possible
Matrix operations are faster than apply functions which are faster than loops (usaully), and existing functions are fastest
Use vectorized function (such as rowSums,colSums,ifelse, etc.) when possible, as they are implemented in C
When using all numeric data matrices will be faster. Creating and and accessing rows in a data.frame takes significantly more time
Data frame columns should be indexed by name rather than column number
Use conditional if statements to do as little computation as possible
On extremely large datasets parallel computing can be used, but be careful
Sometimes it's hard to tell which approach is best, use benchmarking to test it

If still not getting the speeds required, it is possible to write C/C++ code and execute it with the Rcpp library. R is getting faster with every version, but the algorithim matters more than the language.