Statistical sampling in JavaScript

It’s 2016, and that’s an election year, which means we’ll be spending the rest of the summer (and half of the fall) watching America’s most ridiculous spectator sport. The pundits and panels and polls are all quite fun, but I find that the methodology is far more interesting than the results.

One of the greatest weapons in the pollster’s arsenal is sampling, and one of those footnotes you’ll see in opinion polls in the coming weeks is the margin of error. These are basic concepts of statistics, but most people might not know how they work. Worse, some programmers might not. So here’s my attempt at rectifying that.

Definitions

Sampling is, in effect, a way of drawing conclusions about a population (such as a state or country) based on surveying only a small fraction of its members. It’s not perfect, but it turns out that, say, 500 people are actually a pretty good indicator of the rest of the nation…as long as you pick the right 500 people. In terms relatable to current events, a presidential poll that only asks people from rural parts of the South is going to get very different results from one that surveys nothing but New York City. That’s selection bias, and it’s one of the hardest things for pollsters to avoid. They’ve got a few ways around it, such as cold-calling random phone numbers, but it’s always a battle.

That very randomness is why sampling works in the first place. If you truly choose your data points (i.e., the people you ask) randomly, then they will, when put together, approximate the “true” nature of things. The more you get, the closer your picture is to the real thing. Eventually, as you’d expect, your sample size approaches the total population; at that limit, the sample obviously represents the whole to perfection.

For smaller samples, however, things aren’t so perfect. Let’s say we have two candidates: H and T. (You get no points for guessing what those stand for.) In “reality”, they aren’t quite even. Say that H has 50% of the vote, T has 45%, and the remaining 5% are undecided, independent, or whatever. Now, take some sort of random number generator and set it to give numbers from 1 to 100. Everything up to 50 is a “vote” for candidate H, 51-95 are for T, and 96-100 are undecided.

After a single generated number, you’ve got a score of 1 for one of the three, 0 for the other two. Not very predictive, but keep going. With a sample size of 10, my results were H 6, T 3, and 1 undecided. Yours might be different, but notice that that’s already looking a lot closer to the true numbers. Give it 100 tries, and it’s probably even better. (Doing this three times with a different generator, my results were: 52 T, 44 H, 4 undecided; 48 T, 47 H, 5; and 57 T, 40 H, 3. Clearly, this RNG leans right.)

The larger the sample size, the more likely the sample will match the population. If you don’t mind a bit of math, then we can look at just how good a match we can get. The basic formula is e = 1 / sqrt(N), where N is the sample size and e is the margin of error. So, for our sample size of 100 above, the math says that our expected error is somewhere within 1/sqrt(100) = 1/10 = 0.1, or 10% either way. Or, as the polls put it, ±10%. Most samples like this are assumed to be conducted at a 95% confidence interval; this basically means that there’s a 95% chance that the true results lie within that margin. (Note, however, that our third poll in the last paragraph didn’t. It’s an outlier. They happen.)

As counter-intuitive as it may seem, this formula doesn’t really depend on the population size at all, as long as the sample is sufficiently small in relation. For national polls surveying a thousand or so people, that assumption holds, so they can safely tout a margin of error of ±3% from their sample of 1,016.

The code

Now we’ll look at how you can do your own sampling. This isn’t just for opinion polls, though. Any kind of analytics could make use of sampling.

The basic function, in JavaScript, would look something like this:

/*
 * Select `k` random choices from a population
 */
function sample(population, k) {
    var psize = population.length;
    var choices = new Set();
    var result = [];

    // Choose `k` elements from the population, without repeats
    for (var i = 0; i < k; i++) {
        var ch;

        do {
            ch = Math.trunc(Math.random() * psize);
        } while (choices.has(ch))

        choices.add(ch);
    }

    for (var c of choices) {
        result.push(population[c]);
    }

    return result;
}

As always, this isn’t the only way to do what we’re trying to do, but it’s very close to what Python’s random.sample function does, so the idea is sound. To get our sample, we generate a number of array indexes, and the Set guarantees we won’t get any repeats. Our result is a new array containing only those elements we chose. We can then do whatever we need.

But how do we determine what sample size we need? Well, one way is to work backwards from the margin of error we want. Remember that this usually won’t depend on the size of the population.

/*
 * Given a desired margin of error,
 * find an appropriate sample size.
 */
function sampleSize(margin) {
    return Math.round(1 / (margin*margin), 0);
}

This is nothing more than a rearrangement of our formula. Instead of saying e = 1 / sqrt(N), we move things around to solve for N: N = 1 / e^2. Rounding to the nearest integer is just for convenience.

In a nutshell, that’s all there is behind those opinion polls you’ll be hearing so much about over the coming months. Pick enough people at random, and you’ll get a picture of the general populace. It’ll be a blurry picture, but even that’s enough to make out some of the larger details.

Randomness and V8 (JS)

So I’ve seen this post linked in a few places recently, and I thought to myself, “This sounds interesting…and familiar.”

For those of you who don’t quite have the technical know-how to understand what it means, here’s a quick summary. Basically, the whole thing is a flaw in the V8 JavaScript engine’s random number generator. V8, if you don’t know, is what makes JavaScript possible for Chrome, Node, and plenty of others. (Firefox uses a different engine, and Microsoft’s browsers already have far bigger problems.) In JavaScript, the RNG is accessed through the function Math.random(). That function is far from perfect as it is. There’s no need to make it worse.

But V8, until one of the newest versions (4.9.40), actually did make it worse. An outline of their code is in the post above, but the main problems with it are easy to explain. First, Math.random() returns JavaScript numbers—i.e., 64-bit floating-point numbers—between 0 and 1. The way those numbers work leaves the algorithm 52 bits to play with, but V8’s code worked by converting a 32-bit integer into a floating-point number. That’s a pretty common operation, and there’s nothing really wrong on the face of it. Well, except for the part where you’re throwing away 20 out of your 52 random bits.

Because of the way V8’s RNG algorithm (MWC1616), this gets even better. MWC stands for “multiply with carry”, an apt description of what we’re dealing with. Internally, the code has two state variables, each a 32-bit unsigned integer, or uint32_t. These start off as seeded values (JavaScript programmers have no way of influencing this part, unfortunately), and each one undergoes a simple transformation: the low 16 bits are multiplied by one of two “magic” constants, then added to the high 16 bits. The function then creates its result in two parts, with the upper half of the result coming from one state variable’s lower half, while the lower 16 bits are taken from the other state’s upper half.

The whole thing, despite its shell-game shifting of bits, is not much more than a pair of linear congruential generators welded together. LCGs have a long history as random generators, because they’re easy to code, they’re fast, and they can give okay randomness for simple applications. But now that JavaScript is being used everywhere, the cracks are starting to appear.

Since V8’s Math.random() implementation uses 32-bit numbers and none of the “extra” state found in more involved RNGs, you’re never getting more than 2^32^ random numbers before they start repeating. And I do mean repeating, as linear congruential generators are periodic functions. Given the same state, they’ll produce the same result; generate enough random numbers, and you’ll repeat a state, which restarts the cycle. But that 2^32^ is a maximum, and you need some planning to get it. The magic numbers that make an LCG work have to be chosen carefully, or you can sabotage the whole thing. All the bit-shifting tricks are little more than a distraction.

So what can you do? Obviously, you, as a user of Chrome/Node/V8/whatever, can upgrade. The new algorithm they’re using is xorshift128+, which is highly regarded as a solid RNG for non-cryptographic work. (If you’re interested in how it works, but you don’t think you can read C++, check out the second link above, where I roll my own version in JavaScript.) Naturally, this doesn’t fix all the other problems with Math.random(), only the one that caused V8’s version of it to fail a bunch of the statistical tests used to quantify how “good” a specific RNG is. (The linked blog post has a great visualization to illustrate these.) Seeded, repeatable randomness, for example, you’ll still have to handle yourself. But what we’ve got is good enough for a lot of purposes, and it’s now a better foundation to build upon.

Random Number Distributions (JS)

Last week, I talked about a method for choosing random values within a set, one of the many uses for random numbers in a game. This time, I’ll go deeper into the notion of probability distributions, and how you can make use of them in game development and other kinds of programming. A word of warning: this post will have code (usually Javascript) and might use a lot of math!

Simply Random

The uniform distribution is the one you already know. It’s behind all the simple RNG functions we saw before, like C’s rand() or Javascript’s Math.random(), and it’s what you get when you roll a single die: every number has an equal chance of coming up. In math terms, if there are n possibilities, then the chance of getting any specific one is 1/n. Couldn’t be simpler, really. For most game uses, this is your main tool, and the “weighted set” from the other post is a handy add-on.

The Bell Curve

I mentioned the bell curve and the normal distribution last time, but here I’ll go into a little bit more detail. The bell curve, of course, is often an early introduction to the world of statistics and terms like “standard deviation”, and (as we saw in the other post) it’s what you start getting when you roll more and more dice at a time.

Obviously, that’s one way to get a normal distribution: roll a bunch of dice and add them together:

var dieRoll = function() { return Math.floor(Math.random() * 6) + 1; }
var bellCurveRoll = 0;
for (var i = 0; i < 10; i++) {
    bellCurveRoll += dieRoll();
}

This gives us something close (though not perfect): random numbers will range from 10 to 60, with 35 being the most common. If we need something better (i.e., more accurate), we’ll need to delve into probability theory. Don’t worry, it’s only a short dive.

Side Quest

The mean might be familiar to you, since it’s not much more than another name for “average”. For a normal distribution, the mean is the center point of the curve, the part where it’s at its highest.

Less familiar is the standard deviation, although you may remember that from high school or early college. It’s important in a lot of stats-based fields, because it measures how “clustered” a group of data points is.

Now, for a set of data, you can calculate the mean and standard deviation. That’s pretty much how things like grading curves work: you take the data and make a distribution that fits. For RNGs, however, we have to work backwards. Instead of calculating the mean and standard deviation for the data, we choose them at the start, and our RNG uses them to provide the proper distribution of values. For the normal distribution, this means that about 68% of the values will be within 1 standard deviation of the mean, 95% within 2, and 99.7% within 3. (By the way, the standard deviation is usually identified in math by the Greek letter sigma: σ. “3-sigma” confidence, then, is 99.7% likelihood that something is true. “Six sigma” encompasses everything within 6 standard deviations of the mean, or about 99.9999998%; it’s not quite “one in a billion”, but it’s pretty close.)

Back to Normal

So, knowing all this boring math, you can start getting your random numbers. Here’s one way of doing it in Javascript:

function normalDistribution (mu, sigma) {
    var u1 = Math.random();
    var u2 = Math.random();

    var z0 = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(Math.PI*2 * u2);
    var z1 = Math.sqrt(-2.0 * Math.log(u1)) * Math.sin(Math.PI*2 * u2);

    return z0 * sigma + mu;
}

This uses a method called the Box-Muller transform to generate a random number on a bell curve. (The astute reader will notice that the function actually generates two random numbers, but throws one of them away. If you like, you can make a better function that stores the second value and returns it as the next random number.) The two parameters are our mean (mu, because it is often written as the Greek letter μ) and the standard deviation (sigma). The Wikipedia link above explains the theory behind this method, as well as giving links to other ways, some of which might be faster.

Exponential Randomness

Normal distributions have their uses, and they’re about as far as most people get in studying randomness. But we won’t stop there. We’ll move on.

Next up is the exponential distribution. This one isn’t that hard to make in code:

function expoDistribution (lambda) {
    return -Math.log(1.0 - Math.random()) / lambda;
}

But you might wonder if it’s really useful. Well, as it turns out, it does come in handy sometimes. Basically, the exponential distribution can be used anywhere you need a “time between events” type of randomness, like random events firing in a strategy game.

The lambda parameter is what’s called a “rate parameter”, and it’s intimately related to the mean; in fact, it’s the reciprocal: the exponential distribution’s mean is 1 / lambda. But what does it mean? Let’s take an example: say you’re making a game where special power-ups appear, on average, twice every minute. Using the above function with lambda as 2, the result would be the time (in minutes) until the next power-up. (The mean would then be 1/2, or 30 seconds, which makes intuitive sense.)

Pareto, Principle, and Power

If you’ve ever heard of the “Pareto Principle” or the “80-20 rule”, then you’ve got a head start on learning about the Pareto distribution. The general idea is thus: there are a few with a lot, and a lot with a little. “The top 20% control 80% of the wealth” is the most common way to state this particular idea, but a random distribution with these properties can be surprisingly useful in games. Here’s the code:

function paretoDistribution (minimum, alpha) {
    var u = 1.0 - Math.random();
    return minimum / Math.pow(u, 1.0 / alpha);
}

We have two parameters we can control this time. minimum is the lowest possible value that can be returned, while alpha controls the “shape” of the distribution. (Generally, higher values of alpha have a faster drop-off, meaning that lower values have higher probability. The “80-20” distribution has an alpha of log(5) / log(4), or about 1.161.)

The Pareto distribution isn’t just used for wealth, though. Anywhere you have a “long tail”, it might be what you’re looking for. Random city sizes (or star sizes, for an interstellar game) follow a Pareto distribution, and it might be a good way to model “loot drops” in an RPG; less powerful objects are far more common than the Ultimate Sword of Smiting, after all.

In Closing

These aren’t all that’s available, but the four distributions I’ve shown are probably the most useful for programmers, especially game developers. As before, I didn’t originally come up with any of this; it’s all been known since before I was born. Some code is converted from Python’s extensive random module, specifically the functions random.expovariate() and random.paretovariate(). The code for the normal distribution is my Javascript conversion of the Box-Muller transform example at the Wikipedia link above. (By the way, if you want me to post examples for a different language, just ask!)

A lot of people already know all this material, and there are plenty of other sites detailing them better than I can, but I hope that this is enough to get you interested.

Weighted Random Choices (JS)

In programming, we often need to generate random numbers, especially for games. Almost any game will use at least some kind of random number generator (RNG), and most need lots of them to drive the AI, make new levels, and so on.

Any programming language worth using (for games, anyway) will have some way of making an RNG. C has rand(), Javascript has Math.random(), and so on. Game engines usually add in their own ways, like Unity’s Random.value, constructed out of the base provided by whatever language they’re written in. All of these work in basically the same way: you ask the RNG for a value, and it gives you one. Usually, it’s a floating-point number between 0 and 1, which is good if that’s what you need.

Beginners starting out in game development quickly learn how to turn the float value from 0 to 1 (not so useful) into a number in the right range (more useful). The code for rolling a regular, six-sided die usually looks something like this (in Javascript):

var roll = Math.floor(Math.random() * 6) + 1;

It’s a pretty standard technique, and many languages and game engines now have functions that do this for you. And, again, if that’s what you need, then it’s perfect. Sometimes, though, it’s not what you need. This post is for one of those times. Specifically, the case where you need to choose one value from a list.

Simple Choices

When you simply need to pick a single value out of a list (array, vector, whatever), that’s easy. All you’re really doing is the same thing as rolling a die:

var index = Math.floor(Math.random() * array.length);
var choice = array[index];

In effect, we’re choosing a random array index. Easy. Like rolling a die, it gives you an equal chance for each possible value. (If your array contains the numbers from 1 to 6, then you’ve just created a less efficient method of rolling a die.)

Harder Choices

Not everything has an equal probability, though. (Statistics as a science wouldn’t exist if it did. Whether that’s a good or bad thing depends on the way you look at it, I guess.)

Some things can be approximated with the die-rolling method above. Lotteries, for example, work the same way, as do raffles. For some games, that may be all you really need. But other games require more from their RNGs than this.

Take the case of rolling multiple dice. Anybody who has played a role-playing game, or craps in a casino, or even Monopoly knows that, while every number on a die has an equal chance of popping up, things get more complicated when you add more dice. With two dice, for example, a total of 7 is more common than a 2 or 12, because there are more ways to roll numbers that add up to 7: 1+6, 6+1, 2+5, 5+2, 3+4, and 4+3. Add more dice, and the numbers get bigger, the range gets wider, but the idea stays the same: numbers “in the middle” are more likely than those on the outside. (Due to what’s called the central limit theorem, the more dice you roll, the more the graph of possible outcomes starts to resemble a bell curve.)

Rolling a lot of dice is impractical, even on a computer. The probabilities aren’t always so nice and neat that a bell curve works. Maybe you need to choose from a set where the ends are more common than the middle, or a list with a weird distribution of frequencies like the letters in English text. (One out of every eight letters, on average, is E, but Z is over a hundred times less common.) A word game, for instance, would certainly need to do something like this.

Now, there are plenty of different ways of generating random numbers based on frequencies. Here, I’m only going to describe what I think is the simplest. First, we need a problem. Let’s say you’re making a word game that, for whatever reason, uses the letter tiles from Scrabble. (You probably wouldn’t be making an actual Scrabble game, because some people or lawyers might not like that, but we’ll say you’re using the letter frequencies.) Looking at that link, you can probably see that just using random choices won’t help you. We need something different.

First things first, let’s define the set we’re choosing from (note the space at the end, which represents the blank tiles):

var letters = ['a','b','c','d','e','f','g','h',
    'i','j','k','l','m','n','o','p','q','r','s',
    't','u','v','w','x','y','z',' '];

Since this is an uneven distribution, we need some way of representing each probability. In this method, we do this by “weighting” each value. We’ll store these weights in their own array:

var weights = [9, 2, 2, 4, 12, 2, 3, 2,
    9, 1, 1, 4, 2, 6, 8, 2, 1, 6, 4,
    6, 4, 2, 2, 1, 2, 1, 2];

In this case, the sum of all the weights is 102 (100 letters, 2 blanks). Therefore, the ratio of each weight to that sum is the frequency of the letter. (For example, there are 12 E tiles, so E’s frequency is 12/102, or about 11.8%.) That’s the key to this method. Basically, we do something like this:

function randomLetter() {
    var totalWeight = 0;

    for (var w in weights) {
        totalWeight += w;
    }

    var random = Math.floor(Math.random() * totalWeight);

    for (var i = 0; i < letters.length; i++) {
        random -= weights[i];

        if (random < 0) {
            return letters[i];
        }
    }
}

(Obviously, in a real game, we’d need something much more robust, with better error handling, etc., but this is enough to illustrate the point. Also, this isn’t normally how I’d write this function, but I’ve simplified for the same reason.)

The function works by picking a random number from 0 up to the sum of the weights. We then use that as an “index”, but not directly into the array. Instead, we count down from our chosen number, each time subtracting successive weights, until we go below zero, where we produce the corresponding letter.

Let’s say that our random number is 15. We go through the weights array, starting with 9. Subtracting 9 from 15 leaves 6, so we keep going, down by 2 to 4, then by 2 again to 2, then by 4 down to -2. That’s below zero, so that’s where we stop, returning 'd'.

This method isn’t just limited to choosing letters. You can use it anywhere you need a biased sample. Think of a game that has five different types of enemies, each with different chances of spawning. Set up a list of enemy types, another holding their appearance frequencies, and the same method will give you waves of bad guys.

(Note: I’m not claiming credit for any of this. It was all figured out a long time ago, and certainly not by me. I’m not even the first to write about it online. But it’s definitely something beginners should learn, and I hope to post more little “tutorial” articles like this in the coming months, because we were all young, once.)