Software internals: associative arrays and hashing

We’ve already seen a couple of options for storing sequences of data in this series. Early on, we looked at your basic array, for example, and then we later studied linked lists. Both of those are good, solid data structures. Linear arrays are fast and cheap; linked lists are great for storing an arbitrary amount of data for easy retrieval. There’s a reason these two form the backbone of programming.

They do have their downsides, however. Linear arrays and lists both suffer from a problem with indexing. See, a piece of data in an array is accessed by its index, a number you have to keep track of if you ever want to see that data again. With linked lists, thanks to their dynamic structure, you don’t even get that. Finding a value in a list, if you don’t already know what you’re looking for, is practically impossible.

By association

What we need is a structure that gives us the flexibility of a list but the indexing capabilities of an array. Oh, and can we have an array whose indexes are keys we decide, instead of just numbers? Sure, and I bet you want a pony, too. But seriously, we can do this. In fact, your programming language of choice most likely already does it, and you just never knew. JavaScript calls them “objects”, Python “dictionaries”, C++ “maps”, but computer science generally refers to this structure as the associative array. “Associative”, because it associates one value with another. “Array” tells you that it functions more like a linear array, in terms of accessing data, than a list. But instead of computer-assigned numbers, the indexes for an associative array can be just about anything: integers (of your choosing), strings, even whole objects.

The main component of an associative array is the “mapping” between keys (the indices used to access the data) and the values (the actual data you’re storing in the array). One simple way to create this mapping is by making a list of key-value pairs, like so:

class AssociativeArray {
    constructor() {
        // Pairs are objects with 2 attributes
        // k: key, v: value
        this.pairs = []
    }

    get(key) {
        let value = null;
        for (let p of this.pairs) {
            if (p.k === key) {
                value = p.v;
                break;
            }
        }

        return value;
    }

    insert(key, value) {
        if (!this.get(key)) {
            // Add value if key not already present
            this.pairs.push({k: key, v: value});
        } else {
            // *Update* value if key present
            this.pairs = this.pairs.map(
                p => (p.k === key) ?
                    {k: key, v: value} :
                    p
            );
        }
    }

    remove(key) {
        this.pairs = this.pairs.filter(p => (p.k !== key));
    }
}

(Note that I’m being all fancy and using ES6, or ES2015 if you prefer. I’ve also cheated a bit and used builtin Array functions for some operations. That also has the dubious benefit of making the whole thing look more FP-friendly.)

This does work. We could create a new instance of our AssociativeArray class, call it a, and add values to it like this: a.insert("foo", 42). Retrieving them is as easy as a.get("foo"), with a null result meaning no key was found. (If you want to store null values, you’d have to change this to throw an exception or something, which is probably better in the long run, anyway.)

The problem is, it’s not very fast. Our simple associative array is nothing more than a glorified linked list that just so happens to store two values at each node instead of one. For small amounts of data, that’s perfectly fine. If we need to store a lot of information and look it up later, then this naive approach won’t work. We can do better, but it’ll take some thought.

Making a hash of things

What if, instead of directly using the key value, we computed another value from it and used that instead? That might sound silly at first, but think about it. The problem with arbitrary keys is that they’re, well, arbitrary. We have no control over what crazy ideas users will come up with. But if we could create an easy-to-use number from a key, that effectively transforms our associative array into something closer to the familiar linear array.

That’s the thinking behind the hash table, one of the more popular “back ends” for associative arrays. Using a special function (the hash function), the table computes a simple number (the hash value or just hash) from the key. The hash function is chosen so that its output isn’t entirely predictable—though, of course, the same input will always return the same hash—which distributes the possible hash values in a random-looking fashion.

Hashes are usually the size of an integer, such as 32 or 64 bits. Obviously, using these directly is not an option for any realistic array, so some sacrifices have to be made. Most hash tables take a hybrid array/list approach. The hash “space” is divided into a number of buckets, chosen based on needs of space and processor time. Each bucket is a list (possibly even something like what we wrote above), and which bucket a value is placed in is determined by its hash. One easy option here is a simple modulus operation: a hash table with 256 buckets, for instance, would put a key-value pair into the bucket corresponding to the last byte (hash % 256) of its hash.

In code, we might end up with something like this:

class HashTable {
    constructor() {
        this._numBuckets = 256;
        this._buckets = new Array(this._numBuckets);
        this._buckets.fill([]);
    }

    _findInBucket(key, bucket) {
        // Returns element index of pair with specified key
        return this._buckets.findIndex(p => (p.k === key));
    }

    _update(key, value, bucket) {
        for (let i = 0; i < this._buckets[bucket].length; i++) {
            if (this._buckets[bucket][i] === key) {
                this._buckets[bucket][i].v = value;
                break;
            }
        }
    }

    set(key, value) {
        let h = hashCode(key);
        let bucket = h % this._numBuckets;
        let posInBucket = this._findInBucket(key, bucket);

        if (posInBucket === -1) {
            // Not in table, insert
            this._buckets[bucket].push({k: key, v: value});
        } else {
            // Already in table, update
            this._update(key, value, bucket);
        }
    }

    get(key) {
        let h = hashCode(key);
        let bucket = this._buckets[h % this._numBuckets];
        let index = this._findInBucket(key, bucket);

        if (index > -1) {
            return bucket[index].v;
        } else {
            // Key not found
            return null;
        }
    }
}

The function

This code won’t run. That’s not because it’s badly-written. (No, it’s totally because of that, but that’s not the point.) We’ve got an undeclared function stopping us: hashCode(). I’ve saved it for last, as it’s both the hardest and the most important part of a hash table.

A good hash function needs to give us a wide range of values, distributed with little correlation to its input values so as to reduce collisions, or inputs leading to the same hash. For a specific input, it also has to return the same value every time. That means no randomness, but the output needs to “look” random, in that it’s uniformly distributed. With a hash function that does this, the buckets will remain fairly empty; optimally, they’d each only have one entry. The worst-case scenario, on the other hand, puts everything into a single bucket, creating an overly complex linked list with lots of useless overhead.

There are a lot of pre-written hash functions out there, each with its own advantages and disadvantages. Some are general-purpose, while others are specialized for a particular type of data. Rather than walk you through making your own (which is probably a bad idea, for the same reason that making your own RNG is), I’ll let you find one that works for you. Your programming language may already have one: C++, for one, has std::hash.

Incidentally, you may have already seen hash functions “in the wild”. They’re fairly common even outside the world of associative arrays. MD5 and SHA-256, among others, are used as quick checksums for file downloads, as the uniform distribution principle of hashing causes small changes in a key to radically alter the final hash value. It’s also big (and bad) news when collisions are found with a new hash algorithm; very recently—I can’t find the article, or I’d link it here—developers started warning about trusting the “short” hashes used by Git to mark commits, tags, and authors. These are only 32 bits long, instead of the usual 256, so collisions are a lot more likely, and it’s not too hard to pad out a malicious file with enough garbage to give it the same short hash as, say, the latest Linux kernel.

Summing up

Hash tables aren’t the only way to implement associative arrays. For some cases, they’re nowhere near the best. But they do fill the general role of being good enough most of the time. Computing hash codes is the most time-consuming part, with handling collisions and overfull buckets following after that. Unfortunately for higher-level programmers, you don’t have a lot of control over any of these aspects, so optimizing for them is difficult. However, you rarely have to, as the library writers have done most of that work for you. But that’s what this series is about. You may never need to peer into the inner workings of an associative array, but now you’ve got an idea of what you’d find down there.

Leave a Reply

Your email address will not be published. Required fields are marked *