It’s been a while since I’ve used R for parallel computing. Most of the time, the cost-benefit analysis comes down on the side of running a program sequentially on a single processor. I don’t do that many massive tasks that can take advantage of simple parallelization. When the job is big, rather than spend a few days parallelizing everything, I can do other things while the program runs. std.parallelism is usually sufficient for the remaining cases. It’s an impressive library that makes embarrassingly parallel computing easy.
Things have changed for the paper I’m working on now. It will require (a) lots of simulations, and (b) interactivity. It’s painful waiting for my program to run. Two minutes versus thirty seconds is a big deal when you’re adjusting things after each run - it’s not a situation where I can start the program and come back a couple days later to sift through the full set of results. I’m using an updated version of embedr to call R from D. (If you’re wondering, I use D because it’s a lot easier to structure my programs, as well as the ease with which I can speed up bottlenecks.)
The first step was to relearn parallel random number generation in R. These days, the standard approach is to use the installed-by-default parallel package. Unfortunately, the documentation isn’t going to win any awards, so it took me a while to figure out how to use things. I’m using this post to record my learning for future reference. If it helps you, pay it forward by publicly posting about things as you learn them. If it’s wrong and messes up your life, free advice is worth the price you pad for it.
First set the RNG to L’Ecuyer.
RNGkind("L'Ecuyer-CMRG")
Then set the seed for reproducibility purposes:
set.seed(1)
This sets a variable named .Random.seed
. At first, the name made me think it was a seed chosen randomly, but it is actually just a seed to produce random numbers from different streams. That’s the key. You need to set .Random.seed
differently for each processor. Playing around with this a bit:
> s <- .Random.seed
> s
[1] 10407 1280795612 -169270483 -442010614 -603558397 -222347416 1489374793
> set.seed(2)
> s <- .Random.seed
> s
[1] 10407 -897583247 -1619336578 -714750745 -765106180 158863565
[7] -1093870294
Let’s keep working with set.seed(2)
. We can generate three vectors of two standard normal draws using mclapply
:
library(parallel)
mclapply(1:3, function(z) {rnorm(2) })
The output of this call is
[[1]]
[1] 0.37383817 0.06849643
[[2]]
[1] 0.234000 2.146345
[[3]]
[1] 0.353153 -1.150493
Each element of that list represents a vector of two draws from different streams. Can we replicate this behavior? We can advance streams recursively, by calling nextRNGStream
with the seed of the current stream as the argument. So here is how we can create the seeds for the first three streams, which were used in the call to mclapply
above.
s <- .Random.seed
s2 <- nextRNGStream(s)
s3 <- nextRNGStream(s2)
s4 <- nextRNGStream(s3)
s
s2
s3
s4
This returns the seeds
> s
[1] 10407 -506924930 -224593177 995730940 -1171015987 1920912170
[7] -352071005
> s2
[1] 10407 -190164658 -1918052634 -706344082 1226088896 125998489
[7] 426095482
> s3
[1] 10407 1370265189 125884139 -1024980715 1839014385 1408013702
[7] 1816307261
> s4
[1] 10407 -17969283 -255057871 1208658041 34277843 -1106124573
[7] -416264983
Compare those seeds with the ones used by mclapply
:
> mclapply(1:3, function(z) { .Random.seed }, mc.cores=3)
[1] 10407 -190164658 -1918052634 -706344082 1226088896 125998489
[7] 426095482
[[2]]
[1] 10407 1370265189 125884139 -1024980715 1839014385 1408013702
[7] 1816307261
[[3]]
[1] 10407 -17969283 -255057871 1208658041 34277843 -1106124573
[7] -416264983
So for three cores, it will do the work on the second, third, and fourth streams. What if we use only two cores?
> mclapply(1:3, function(z) { .Random.seed }, mc.cores=2)
[[1]]
[1] 10407 -190164658 -1918052634 -706344082 1226088896 125998489
[7] 426095482
[[2]]
[1] 10407 1370265189 125884139 -1024980715 1839014385 1408013702
[7] 1816307261
[[3]]
[1] 10407 -190164658 -1918052634 -706344082 1226088896 125998489
[7] 426095482
So it uses the second seed over again, but starts at a different part of the stream. The mclapply function calls mcparallel, and it must somehow start at a different part of the same stream from inside there. I haven’t spent any time trying to figure that out.
We’re now in position to mimic the behavior of mclapply
by taking the same draws. Here’s a complete program that you should run from scratch (i.e., restart R first) in order to ensure the state of the generators has not advanced.
> library(parallel)
> RNGkind("L'Ecuyer-CMRG")
> set.seed(2)
# Capture the seeds of the first four streams
> s <- .Random.seed
> s2 <- nextRNGStream(s)
> s3 <- nextRNGStream(s2)
> s4 <- nextRNGStream(s3)
# Verify that the seeds of the streams are the same
> mclapply(1:3, function(z) { .Random.seed }, mc.cores=3)
[[1]]
[1] 10407 700752180 2103112927 29966053 2098487490 711969013 269555427
[[2]]
[1] 10407 -1852663587 543942860 -126724479 -1713719834 2049732810
[7] 1743697782
[[3]]
[1] 10407 1894375638 858767463 -37071576 -803249685 1694193397 -118425015
> s2
[1] 10407 700752180 2103112927 29966053 2098487490 711969013 269555427
> s3
[1] 10407 -1852663587 543942860 -126724479 -1713719834 2049732810
[7] 1743697782
> s4
[1] 10407 1894375638 858767463 -37071576 -803249685 1694193397 -118425015
# Generate three vectors using mclapply
> mclapply(1:3, function(z) { rnorm(2) }, mc.cores=3)
[[1]]
[1] -0.3448399 -1.0700379
[[2]]
[1] 0.69691329 0.04818598
[[3]]
[1] -0.5003859 -1.3517591
# This does not produce numbers generated by the
# call to mclapply - that seed is not used
> .Random.seed <- s
> rnorm(2)
[1] -1.242582 -1.767274
# Seeds s2, s3, and s4 are able to reproduce those results
> .Random.seed <- s2
> rnorm(2)
[1] -0.3448399 -1.0700379
> .Random.seed <- s3
> rnorm(2)
[1] 0.69691329 0.04818598
> .Random.seed <- s4
> rnorm(2)
[1] -0.5003859 -1.3517591
This gives you a couple of options. If you can do all of the random number generation in one shot, just take the easy route and use mclapply
. If you need to do random number generation with other operations mixed in, you should create separate R programs and set the seed for each, then run them using something like GNU parallel.