It’s 2016, and that’s an election year, which means we’ll be spending the rest of the summer (and half of the fall) watching America’s most ridiculous spectator sport. The pundits and panels and polls are all quite fun, but I find that the methodology is far more interesting than the results.
One of the greatest weapons in the pollster’s arsenal is sampling, and one of those footnotes you’ll see in opinion polls in the coming weeks is the margin of error. These are basic concepts of statistics, but most people might not know how they work. Worse, some programmers might not. So here’s my attempt at rectifying that.
Definitions
Sampling is, in effect, a way of drawing conclusions about a population (such as a state or country) based on surveying only a small fraction of its members. It’s not perfect, but it turns out that, say, 500 people are actually a pretty good indicator of the rest of the nation…as long as you pick the right 500 people. In terms relatable to current events, a presidential poll that only asks people from rural parts of the South is going to get very different results from one that surveys nothing but New York City. That’s selection bias, and it’s one of the hardest things for pollsters to avoid. They’ve got a few ways around it, such as cold-calling random phone numbers, but it’s always a battle.
That very randomness is why sampling works in the first place. If you truly choose your data points (i.e., the people you ask) randomly, then they will, when put together, approximate the “true” nature of things. The more you get, the closer your picture is to the real thing. Eventually, as you’d expect, your sample size approaches the total population; at that limit, the sample obviously represents the whole to perfection.
For smaller samples, however, things aren’t so perfect. Let’s say we have two candidates: H and T. (You get no points for guessing what those stand for.) In “reality”, they aren’t quite even. Say that H has 50% of the vote, T has 45%, and the remaining 5% are undecided, independent, or whatever. Now, take some sort of random number generator and set it to give numbers from 1 to 100. Everything up to 50 is a “vote” for candidate H, 51-95 are for T, and 96-100 are undecided.
After a single generated number, you’ve got a score of 1 for one of the three, 0 for the other two. Not very predictive, but keep going. With a sample size of 10, my results were H 6, T 3, and 1 undecided. Yours might be different, but notice that that’s already looking a lot closer to the true numbers. Give it 100 tries, and it’s probably even better. (Doing this three times with a different generator, my results were: 52 T, 44 H, 4 undecided; 48 T, 47 H, 5; and 57 T, 40 H, 3. Clearly, this RNG leans right.)
The larger the sample size, the more likely the sample will match the population. If you don’t mind a bit of math, then we can look at just how good a match we can get. The basic formula is e = 1 / sqrt(N)
, where N
is the sample size and e
is the margin of error. So, for our sample size of 100 above, the math says that our expected error is somewhere within 1/sqrt(100) = 1/10 = 0.1
, or 10% either way. Or, as the polls put it, ±10%. Most samples like this are assumed to be conducted at a 95% confidence interval; this basically means that there’s a 95% chance that the true results lie within that margin. (Note, however, that our third poll in the last paragraph didn’t. It’s an outlier. They happen.)
As counter-intuitive as it may seem, this formula doesn’t really depend on the population size at all, as long as the sample is sufficiently small in relation. For national polls surveying a thousand or so people, that assumption holds, so they can safely tout a margin of error of ±3% from their sample of 1,016.
The code
Now we’ll look at how you can do your own sampling. This isn’t just for opinion polls, though. Any kind of analytics could make use of sampling.
The basic function, in JavaScript, would look something like this:
/*
* Select `k` random choices from a population
*/
function sample(population, k) {
var psize = population.length;
var choices = new Set();
var result = [];
// Choose `k` elements from the population, without repeats
for (var i = 0; i < k; i++) {
var ch;
do {
ch = Math.trunc(Math.random() * psize);
} while (choices.has(ch))
choices.add(ch);
}
for (var c of choices) {
result.push(population[c]);
}
return result;
}
As always, this isn’t the only way to do what we’re trying to do, but it’s very close to what Python’s random.sample
function does, so the idea is sound. To get our sample, we generate a number of array indexes, and the Set
guarantees we won’t get any repeats. Our result is a new array containing only those elements we chose. We can then do whatever we need.
But how do we determine what sample size we need? Well, one way is to work backwards from the margin of error we want. Remember that this usually won’t depend on the size of the population.
/*
* Given a desired margin of error,
* find an appropriate sample size.
*/
function sampleSize(margin) {
return Math.round(1 / (margin*margin), 0);
}
This is nothing more than a rearrangement of our formula. Instead of saying e = 1 / sqrt(N)
, we move things around to solve for N
: N = 1 / e^2
. Rounding to the nearest integer is just for convenience.
In a nutshell, that’s all there is behind those opinion polls you’ll be hearing so much about over the coming months. Pick enough people at random, and you’ll get a picture of the general populace. It’ll be a blurry picture, but even that’s enough to make out some of the larger details.