x1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)Introduction
Why speeding up your code? If it works and is readable, why bothering optimizing the execution time? For most scripts, even it is comprises hundreds of lines of code, they will execute in a matter of seconds maximum. Generally, rendering figures is the most time-consuming part of R scripts, and even this is not that long.
Now, some people – me, for example – are working on code scripts that may take dozen of minutes or even hours to execute. For example, during my PhD, I used jagsUI to run MCMC simulations that took a few minutes to run, and had to run a lot of these simulations. Now, at the time I write this post, I am working on an individual-based model of the dynamics of African swine fever in a population of wild boars. In this model, what happens to each boar is modeled each day. Thus, there are a lot of things at work and it takes some time to run one simulation (sometimes, more than two hours, and the model is not finished yet), and reducing the computing time is interesting for me for multiple reasons:
my results are available sooner;
less computing time means less resources used;
I work on a shared computing cluster, so the less time my simulations run, the better for everyone wanting to use the cluster.
For all these reasons, I spent (and continue to spend) quite some time trying to speed up my code. I found multiple tricks, some well known, some more obscure, some useful for everyone and some useful only in specific cases. Here, I would like to share some of these tricks.
General tips
Sequences
In R, there are multiple ways to produce a sequence of numbers. The easiest one is to create a vector with the c() function.
On the example above, we have a regular sequence of integers from one to ten. In this specific case, this is not the optimal way to create the sequence, and most of R users would rather use
x2 = seq(1, 10, 1)or, even simpler,
x3 = 1:10Now, are these two lines of code identical? They produce the same results, as shown below.
x2 == x3 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
So you could think that both techniques could be used interchangeably. And you would not be wrong. But try running these functions a lot of times, and you will see strong differences appear. For the sake of the example, I will use larger sequences sizes.
system.time(
for (i in 1:1e4) {
seq(1, 1e5, 1)
}
)utilisateur système écoulé
4.57 0.69 5.25
system.time(
for (i in 1:1e8) {
1:1e5
}
)utilisateur système écoulé
2.61 0.01 2.63
In the code above, I asked 10,000 times for a sequence ranging from one to 100,000 using seq(). On my PC, R delivered in 5.25 seconds. Then I asked 100,000,000 times for the same sequence, but using the simpler notation, and it delivered in less than half the time (for a much higher number of sequences).
Obviously, you will probably rarely need to ask for this many sequences, and my example only works for consecutive sequences of integers1, but hey it may help someone out there.
Loops
There is a “legend” among R users that using for loops slows down your code a lot, and that you should rather use *apply() functions to make your code run faster. I would like to bring some nuance to that. Consider the example below.
x1 = x2 = 1:1e6
system.time(
for (i in 1:length(x1)) {
x1[i] = x1[i] * 2
}
)utilisateur système écoulé
0.03 0.00 0.03
system.time(
{x2 = sapply(x2, function(i) i * 2)}
)utilisateur système écoulé
0.69 0.00 0.69
In this case, I first used a for loop to multiply by two the values in vector x1. Then, I did the exact same operation with vector x2 but using the sapply() function. We see below that the results are exactly the same, but using the for loop was 20 times faster!
table(x1 == x2)
TRUE
1000000
But there are indeed cases where using a function from the *apply() family really speeds up your code. In the example below, I create a data frame with 10,000,000 values (df$x) distributed in 1,000 categories (df$y). Then, for each category, I calculate the average value of df$x using either a for loop or the tapply() function.
set.seed(42)
df = data.frame(
x = runif(1e7),
y = sample(1:1e3, size = 1e5, replace = T)
)
system.time(
for (i in unique(df$y)) {
mean(df$x[df$y == i])
}
)utilisateur système écoulé
18.82 3.69 22.61
system.time(
tapply(df$x, df$y, mean)
)utilisateur système écoulé
0.26 0.09 0.36
We can see using the tapply() function is way, way faster than using the loop. So, in this case, I would strongly encourage using the tapply() function, especially if you must it on really large data sets.
So, in the end, when should you use a loop and when should you use a *apply() function? I do not know! It seems to depend on what exactly you are doing, so I encourage you to try both approaches if you see your code getting slower.
Session info
sessionInfo()R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8
[3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C
[5] LC_TIME=French_France.utf8
time zone: Europe/Paris
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.5.1 fastmap_1.2.0 cli_3.6.5
[5] tools_4.5.1 htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10
[9] rmarkdown_2.29 knitr_1.50 jsonlite_2.0.0 xfun_0.53
[13] digest_0.6.37 rlang_1.1.6 evaluate_1.0.5
Footnotes
You can transform the outputs of the sequence, for example
1:1e5 / 2but doing so rises the execution time, and the amount of time lost is linked to the complexity of the transformation.↩︎