Cobbling together parallel random number generation in Julia

I’m starting to work on some computationally demanding projects, (Monte Carlo simulations of bootstraps of out-of-sample forecast comparisons) so I thought I should look at Julia some more. Unfortunately, since Julia’s so young (it’s almost at version 0.3.0 as I write this) a lot of code still needs to be written. Like Random Number Generators (RNGs) that work in parallel. So this post describes an approach that parallelizes computation using a standard RNG; for convenience I’ve put the code (a single function) is in a grotesquely ambitiously named package on GitHub: ParallelRNGs.jl. (Also see this thread on the Julia Users mailing list.)

A few quick points about RNGs and simulations. Most econometrics papers have a section that examines the performance of a few estimators in a known environment (usually the estimators proposed by the paper and a few of the best preexisting estimators). We do this by simulating data on a computer, using that data to produce estimates, and then comparing those estimate to the parameters they’re estimating. Since we’ve generated the data ourselves, we actually know the true values of those parameters, so we can make a real comparison. Do that for 5000 simulated data sets and you can get a reasonably accurate view of how the statistics might perform in real life.

For many reasons, it’s useful to be able to reproduce the exact same simulations again in the future. (Two obvious reasons: it allows other researchers to be able to reproduce your results, and it can make debugging much faster when you discover errors.) So we almost always use pseudo Random Number Generators that use a deterministic algorithm to produce a stream of numbers that behaves in important ways like a stream of independent random values. You initialize these RNGs by setting a starting value (the “pseudo” aspect of the RNGs is implicit from now on) and anyone who has that starting value can reproduce the identical sequence of numbers that you generated. A popular RNG is the “Mersenne Twister,” and “popular” is probably an understatement: it’s the default RNG in R, Matlab, and Julia. And (from what I’ve read; this isn’t my field at all) it’s well regarded for producing a sequence of random numbers for statistical simulations.

But it’s not necessarily appropriate for producing several independent sequences of random numbers. Which is vitally important because I have an 8 core workstation that needs to run lots of simulations, and I’d like to execute 1/8th of the total simulations on each of its cores.

There’s a common misconception that you can get independent random sequences just by choosing different initial values for each sequence, but that’s not guaranteed to be true. There are algorithms for choosing different starting values that are guaranteed to produce independent streams for the Mersenne Twister (see this research by one of the MT’s inventors), but they aren’t implemented in Julia yet. (Or in R, as far as I can tell; they use a different RNG for parallel applications.) And it turns out that Mersenne Twister is the only RNG that’s included in Julia so far.

So, this would be a perfect opportunity for me to step up and implement some of these advanced algorithms for the Mersenne Twister. Or to implement some of the algorithms developed by L’Ecuyer and his coauthors, which are what R uses. And there’s already C code for both options.

But I haven’t done that yet. I’m lazy busy.

Instead, I’ve written an extremely small function that wraps Julia’s default RNG, calls it from the main process alone to generate random numbers, and then sends those random numbers to each of the other processes/cores where the rest of the simulation code runs. The function’s really simple:

function replicate(sim::Function, dgp::Function, n::Integer)
    function rvproducer()
        for i=1:n
    return(pmap(sim, Task(rvproducer)))

That’s all. If you’re not used to Julia, you can ignore the “::Function” and the “::Integer” parts of the arguments. Those just identify the datatype of the argument and you can read it as “dgp_function” if you want (and explicitly providing the types like this is optional anyway). So, you give “replicate” two functions: “dgp” generates the random numbers and “sim” does the remaining calculations; “n” is the number of simulations to do. All of the work is done in “pmap” which parcels out the random numbers and sends them to different processors. (There’s a simplified version of the source code for pmap at that link.)

And that’s it. Each time a processor finishes one iteration, pmap calls dgp() again to generate more random numbers and passes them along. It automatically waits for dgp() to finish, so there are no race conditions and it produces the exact same sequence of random numbers every time. The code is shockingly concise. (It shocked me! I wrote it up assuming it would fail so I could understand pmap better and I was pretty surprised when it worked.)

A quick example might help clear up it’s usage. We’ll write a DGP for the bootstrap:

const n = 200     #% Number of observations for each simulation
const nboot = 299 #% Number of bootstrap replications
addprocs(7)       #% Start the other 7 cores
dgp() = (randn(n), rand(1:n, (n, nboot)))

The data are iid Normal, (the “randn(n)” component) and it’s an iid nonparametric bootstrap (the “rand(1:n, (n, nboot))”, which draws independent values from 1 to n and fills them into an n by nboot matrix). Oh, and there’s a good reason for those weird “#%” comments; “#” is Julia’s comment character, but WordPress doesn’t support syntax highlighting for Julia, so we’re pretending this is Matlab code. And “%” is Matlab’s comment character, which turns the comment green.

We’ll use a proxy for some complicated processing step:

@everywhere function sim(x)
    nboot = size(x[2], 2)
    bootvals = Array(Float64, nboot)
    for i=1:nboot
        bootvals[i] = mean(x[1][x[2][:,i]])
    confint = quantile(bootvals, [0.05, 0.95])
    sleep(3) #% not usually recommended!
    return(confint[1] < 0 < confint[2])

So “sim” calculates the mean of each bootstrap sample and calculates the 5th and 95th percentile of those simulated means, giving a two-sided 90% confidence interval for the true mean. Then it checks whether the interval contains the true mean (0). And it also wastes 3 seconds sleeping, which is a proxy for more complicated calculations but usually shouldn’t be in your code. The initial “@everywhere” is a Julia macro that loads this function into each of the separate processes so that it’s available for parallelization. (This is probably as good a place as any to link to Julia’s “Parallel Computing” documentation.)

Running a short Monte Carlo is simple:

julia> srand(84537423); #% Initialize the default RNG!!!
julia> @time mc1 = mean(replicate(sim, dgp, 500))

elapsed time: 217.705639 seconds (508892580 bytes allocated, 0.13% gc time)
0.896 #% = 448/500

So, about 3.6 minutes and the confidence intervals have coverage almost exactly 90%.

It’s also useful to compare the execution time to a purely sequential approach. We can do that by using a simple for loop:

function dosequential(nsims)
    boots = Array(Float64, nsims)
    for i=1:nsims
        boots[i] = sim(dgp())
    return boots

And, to time it:

julia> dosequential(1); #% Force compilation before timing
julia> srand(84537423); #% Reinitialize the default RNG!!!
julia> @time mc2 = mean(dosequential(500))

elapsed time: 1502.038961 seconds (877739616 bytes allocated, 0.03% gc time)
0.896 #% = 448/500

This takes a lot longer: over 25 minutes, 7 times longer than the parallel approach (exactly what we’d hope for, since the parallel approach runs the simulations on 7 cores). And it gives exactly the same results since we started the RNG at the same initial value.

So this approach to parallelization is great… sometimes.

This approach should work pretty well when there aren’t that many random numbers being passed to each processor, and when there aren’t that many simulations being run; i.e. when “sim” is an inherently complex calculation. Otherwise, the overhead of passing the random numbers to each process can start to matter a lot. In extreme cases, “dosequential” can be faster than “replicate” because the overhead of managing the simulations and passing around random variables dominates the other calculations. In those applications, a real parallel RNG becomes a lot more important.

If you want to play with this code yourself, I made a small package for the replicate function: ParallelRNGs.jl on GitHub. The name is misleadingly ambitious (ambitiously misleading?), but if I do add real parallel RNGs to Julia, I’ll put them there too. The code is still buggy, so use it at your own risk and let me know if you run into problems. (Filing an issue on GitHub is the best way to report bugs.)

P.S. I should mention again that Julia is an absolute joy of a language. Package development isn’t quite as nice as in Clojure, where it’s straightforward to load and unload variables from the package namespace (again, there’s lots of code that still needs to be written). But the actual language is just spectacular and I’d probably want to use it for simulations even if it were slow. Seriously: seven lines of new code to get an acceptable parallel RNG.

Debugging MLE code

A grad student emailed me about problems she’s running into with MLE in R. In short, the parameter estimates are fairly sensitive to the optimization’s starting values, which is a problem. Since it might be useful for other people, this is the reply I sent (any other suggestions would be welcome!):

Those problems can show up when the likelihood function is relatively flat or when the likelihood has local maxima. You can diagnose these problems a few ways.

  1. Plot the likelihood function. If it only depends on a single parameter, this is really easy. If it depends on several parameters, you’ll want to fix all but one or two of the parameters and plot it as a function of the remaining one or two. (And then repeat this process for different values of the other parameters, and repeat it while fixing and varying other parameters.)
  2. Check it for simulated data with *lots* of observations (e.g. millions). When there are a lot of observations, the likelihood should be quadratic around the true parameter values, so there should be fewer problems with local maxima.
  3. Do them both: make plots with lots of observations and verify that the likelihood is approximately quadratic around the true parameter value.

If you’re supplying a gradient function it can cause problems as well. You can compare the analytical gradient to the output from numericDeriv to verify that your gradient is correct. (numericDeriv is for R code, which is what the student asked about specifically; other languages should have similar functions.)

An econometrician’s reply to “How do economists figure out how the world really works?”

Editorial note: this post originally appeared on Medium; I’m moving it here with some minor edits. —Gray

Mark Thoma has a new post/column on econometrics, “How do economists figure out how the world really works.” Usually this would be great, but he kind of overstates how awesome we are and I want to clear a few things up. All of the quotations below are from Thoma’s column.

Experimental vs. observational studies

Thoma: The essential difference between statistics and econometrics is the inability of economists to perform laboratory experiments where the effect of one variable on another can be examined while holding “all else equal…”

Thus, while data from laboratory experiments is usually confined to two variables and the experiments can be performed repeatedly to ensure a single observation is not a statistical fluke, economists must use historical data — a single observation — where all else is definitely not equal.

This is an oversimplification. We don’t have very much experimental data in economics; that part is true. (Although we do have some. One of the biggest research areas over the last 10 years has been on implementing medium-scale “field experiments.”) But lots of statisticians work primarily with nonexperimental data. Surveys and polls are not experiments. Most public health studies are not experiments and epidemiology is nonexperimental. Any time-series data is usually not coming from an experiment. I could go on and on.

There is an important distinction that I think Thoma’s trying to make here, though. Econometricians usually worry much more about how their subjects make decisions than statisticians do, and how those decisions will affect the outcomes of interest. Here’s an example of what I mean: suppose we want to know how getting a college degree affects labor market outcomes — to make it even more specific, suppose we want to know whether the starting salary of college graduates is higher on average than their salary would be if they had started to look for a job immediately after high-school. This is an example of the sort of thing labor economists research all the time.

Now, this would be easy to answer if we could do an experiment: match up pairs of comparable college applicants, admit one of them to college, and prevent the other from going to college. (Let’s ignore the issue of how representative this sample would be. In this imaginary experiment, pretend we could force high-school graduates to go to college too.) But that experiment is unethical and would never happen, so a naive alternative approach is to track high-school students who go to college and students who don’t go to college, and compare their salaries in four years. (Also pretend that everyone who goes to college graduates in exactly four years. I know that’s not true, and a real research study would have to account for different outcomes in college. But accounting for all of those details would detract from the main point I’m trying to make.)

Linear regression/multiple regression analysis

Thoma: For this reason, economists use a technique called multiple regression analysis. Essentially, what this means is that the effect of the treatment on the outcome is examined by including a (sometimes large) set of controls to account for all of the variables that cannot be held constant.

“Multiple regression analysis” is another way of saying “linear regression” so I’ll use that term. And every empirical researcher in every discipline learns linear regression. It’s usually the first method taught! So I don’t know why Thoma’s trying to claim that it’s a special approach used by economists alone.

Here’s how it might work in the college example. Instead of comparing all of the students who went to college against all of the students who didn’t, we might match them up based on SAT scores, GPAs, demographic information, and anything else we thought was relevant. So we look at all of the students who have a GPA on graduation between 3.8 and 4.0, an SAT score between 2200 and 2400, are white females who live in rural areas, etc., and compare the salaries in 4 years of the students in that category who went to college to those who did not. Then do the same thing for every other way of splitting up these variables. What I’ve described is a version of linear regression. If you have strong evidence of exactly how a variable like GPA affects salaries in 4 years, you can do a more sophisticated version of this approach, and that’s probably closer to what Thoma has in mind.

Again, this is a statistical approach that was invented by statisticians and is used by researchers in every empirical discipline for observational data and for experimental data alike. It is in no way unique to economics.

It also doesn’t work very well in economics, and ironically this is where econometrics starts to diverge from more conventional statistics. And the reason that it doesn’t work very well, of course, is that rural white females with high GPAs and high SAT scores who choose to go to college tend to be different from those who choose to immediately look for a job. And we know they’re different, first and foremost, because they choose to go to college.

All else equal, if someone expects to benefit from college, he she is more likely to attend. And if someone expects to benefit less from college, he or she is less likely to attend. Unless these students are completely unable to predict whether they’ll benefit from college — and “completely unable” isn’t just rhetoric here, “mostly unable” isn’t enough — the high-school graduates attending college are more likely to benefit than the high-school graduates who aren’t attending.

This biases any estimates of the effects of college attendance. Instead of getting the pure effect, we’d be estimating a mixture of the effect of attending college and the effects of all of the unobservable factors that make someone benefit from college. Even if college itself would offer no benefit to the high-school graduates who have chosen not to go, we would still (incorrectly!) measure a positive benefit.

So much for linear regression.

(Another editorial note from the future, in retrospect I’m projecting too much. Linear regression is used by economists all the time. But it shouldn’t be! —Gray)

Now, econometricians aren’t the only people concerned about these selection issues. Paul Rosenbaum, just as one example, has written a lot about these issues, and he’s a statistician. (He even has great books and papers with titles like “Observational Studies,” just to get to back to the experimental vs. observational issue from earlier.) But these selection issues are front and center in econometrics: they are the biggest and most important concern. So we don’t use linear regression very often (or, at least, we don’t trust linear regression very much. It still gets used.), and a large part of econometrics has been devoted to finding alternatives.

Empirical research in Macroeconomics

Thoma: But there is one important caveat, something that is particularly problematic for testing macroeconomic theories. Most macroeconomists know the data on GDP, employment prices, interest rates, productivity and so on fairly well. So it is not very useful to build a theoretical model to explain these data, and then test to see how well it fits. Of course the model would fit. After all, why build a model that is inconsistent with the data you already know about?

And that is the key — to use data researchers did not know about when the model was built. Testing models against data that is revealed only after the model is built is the best way to do this….

Most macroeconomic models do not fit well by any measure. The point that I suspect Thoma’s trying to make is that models look unrealistically good when you evaluate them with the same data you used to build the model. When you evaluate them with new data, they start to look much worse. And this is completely true. The go-to reference is Meese and Rogoff (1983, J. Intl. Econ.), a study that looked at whether the exchange rate models of the day could out-forecast a simple benchmark model: the number zero. (tl;dr: the models couldn’t, “no change” won. This was very surprising.)

Last thoughts

I’ll wrap up by summarizing my own thoughts. Econometrics has been very influential in macroeconomics, but not because of linear regression or out-of-sample evaluation, and it’s benefited enormously from the contributions of dyed-in-the-wool statisticians. And econometrics has been really useful in learning how to estimate and build macroeconomic models. (I should mention that three recent Nobel Prizes have been awarded at least in part for time-series econometrics: 2003 (Engle and Granger), 2011 (Sargent and Sims), and 2013 (Hansen). None of them do out-of-sample evaluation. If you go to the Nobel Prize website, you can find descriptions of their research!)

But we really learn a lot about the economy when the existing models fail: the Great Depression and recovery taught economists a lot about the business cycle; the sudden increase in inflation in the late 70s and the subsequent recession and disinflation in the early 80s taught us a lot about inflation; and the current financial crisis is teaching us a lot about the importance of the financial sector in the economy; etc. These are extremely expensive experiments to run and we badly need to figure out how to avoid them.

Unemployment graphs updated with latest data

US unemployment rate through April, 2014

We’ve updated our graphs with the BLS released the latest unemployment data. April’s unemployment rate was estimated at 6.3%, which is lower than March’s 6.7%. As I say every month, there’s a pretty clear downward trend, and trying to guess if it’s changed based on the last few months of data is pretty useless; 6.7% is consistent with the recovery we’ve seen for the last several years. (At some point, maybe this summer if I have time, I’ll need to quantify how little we learn from each revision so I can stop typing this disclaimer month after month.)

Some links from the usual suspects: