Software internals: associative arrays and hashing

We’ve already seen a couple of options for storing sequences of data in this series. Early on, we looked at your basic array, for example, and then we later studied linked lists. Both of those are good, solid data structures. Linear arrays are fast and cheap; linked lists are great for storing an arbitrary amount of data for easy retrieval. There’s a reason these two form the backbone of programming.

They do have their downsides, however. Linear arrays and lists both suffer from a problem with indexing. See, a piece of data in an array is accessed by its index, a number you have to keep track of if you ever want to see that data again. With linked lists, thanks to their dynamic structure, you don’t even get that. Finding a value in a list, if you don’t already know what you’re looking for, is practically impossible.

By association

What we need is a structure that gives us the flexibility of a list but the indexing capabilities of an array. Oh, and can we have an array whose indexes are keys we decide, instead of just numbers? Sure, and I bet you want a pony, too. But seriously, we can do this. In fact, your programming language of choice most likely already does it, and you just never knew. JavaScript calls them “objects”, Python “dictionaries”, C++ “maps”, but computer science generally refers to this structure as the associative array. “Associative”, because it associates one value with another. “Array” tells you that it functions more like a linear array, in terms of accessing data, than a list. But instead of computer-assigned numbers, the indexes for an associative array can be just about anything: integers (of your choosing), strings, even whole objects.

The main component of an associative array is the “mapping” between keys (the indices used to access the data) and the values (the actual data you’re storing in the array). One simple way to create this mapping is by making a list of key-value pairs, like so:

class AssociativeArray {
    constructor() {
        // Pairs are objects with 2 attributes
        // k: key, v: value
        this.pairs = []
    }

    get(key) {
        let value = null;
        for (let p of this.pairs) {
            if (p.k === key) {
                value = p.v;
                break;
            }
        }

        return value;
    }

    insert(key, value) {
        if (!this.get(key)) {
            // Add value if key not already present
            this.pairs.push({k: key, v: value});
        } else {
            // *Update* value if key present
            this.pairs = this.pairs.map(
                p => (p.k === key) ?
                    {k: key, v: value} :
                    p
            );
        }
    }

    remove(key) {
        this.pairs = this.pairs.filter(p => (p.k !== key));
    }
}

(Note that I’m being all fancy and using ES6, or ES2015 if you prefer. I’ve also cheated a bit and used builtin Array functions for some operations. That also has the dubious benefit of making the whole thing look more FP-friendly.)

This does work. We could create a new instance of our AssociativeArray class, call it a, and add values to it like this: a.insert("foo", 42). Retrieving them is as easy as a.get("foo"), with a null result meaning no key was found. (If you want to store null values, you’d have to change this to throw an exception or something, which is probably better in the long run, anyway.)

The problem is, it’s not very fast. Our simple associative array is nothing more than a glorified linked list that just so happens to store two values at each node instead of one. For small amounts of data, that’s perfectly fine. If we need to store a lot of information and look it up later, then this naive approach won’t work. We can do better, but it’ll take some thought.

Making a hash of things

What if, instead of directly using the key value, we computed another value from it and used that instead? That might sound silly at first, but think about it. The problem with arbitrary keys is that they’re, well, arbitrary. We have no control over what crazy ideas users will come up with. But if we could create an easy-to-use number from a key, that effectively transforms our associative array into something closer to the familiar linear array.

That’s the thinking behind the hash table, one of the more popular “back ends” for associative arrays. Using a special function (the hash function), the table computes a simple number (the hash value or just hash) from the key. The hash function is chosen so that its output isn’t entirely predictable—though, of course, the same input will always return the same hash—which distributes the possible hash values in a random-looking fashion.

Hashes are usually the size of an integer, such as 32 or 64 bits. Obviously, using these directly is not an option for any realistic array, so some sacrifices have to be made. Most hash tables take a hybrid array/list approach. The hash “space” is divided into a number of buckets, chosen based on needs of space and processor time. Each bucket is a list (possibly even something like what we wrote above), and which bucket a value is placed in is determined by its hash. One easy option here is a simple modulus operation: a hash table with 256 buckets, for instance, would put a key-value pair into the bucket corresponding to the last byte (hash % 256) of its hash.

In code, we might end up with something like this:

class HashTable {
    constructor() {
        this._numBuckets = 256;
        this._buckets = new Array(this._numBuckets);
        this._buckets.fill([]);
    }

    _findInBucket(key, bucket) {
        // Returns element index of pair with specified key
        return this._buckets.findIndex(p => (p.k === key));
    }

    _update(key, value, bucket) {
        for (let i = 0; i < this._buckets[bucket].length; i++) {
            if (this._buckets[bucket][i] === key) {
                this._buckets[bucket][i].v = value;
                break;
            }
        }
    }

    set(key, value) {
        let h = hashCode(key);
        let bucket = h % this._numBuckets;
        let posInBucket = this._findInBucket(key, bucket);

        if (posInBucket === -1) {
            // Not in table, insert
            this._buckets[bucket].push({k: key, v: value});
        } else {
            // Already in table, update
            this._update(key, value, bucket);
        }
    }

    get(key) {
        let h = hashCode(key);
        let bucket = this._buckets[h % this._numBuckets];
        let index = this._findInBucket(key, bucket);

        if (index > -1) {
            return bucket[index].v;
        } else {
            // Key not found
            return null;
        }
    }
}

The function

This code won’t run. That’s not because it’s badly-written. (No, it’s totally because of that, but that’s not the point.) We’ve got an undeclared function stopping us: hashCode(). I’ve saved it for last, as it’s both the hardest and the most important part of a hash table.

A good hash function needs to give us a wide range of values, distributed with little correlation to its input values so as to reduce collisions, or inputs leading to the same hash. For a specific input, it also has to return the same value every time. That means no randomness, but the output needs to “look” random, in that it’s uniformly distributed. With a hash function that does this, the buckets will remain fairly empty; optimally, they’d each only have one entry. The worst-case scenario, on the other hand, puts everything into a single bucket, creating an overly complex linked list with lots of useless overhead.

There are a lot of pre-written hash functions out there, each with its own advantages and disadvantages. Some are general-purpose, while others are specialized for a particular type of data. Rather than walk you through making your own (which is probably a bad idea, for the same reason that making your own RNG is), I’ll let you find one that works for you. Your programming language may already have one: C++, for one, has std::hash.

Incidentally, you may have already seen hash functions “in the wild”. They’re fairly common even outside the world of associative arrays. MD5 and SHA-256, among others, are used as quick checksums for file downloads, as the uniform distribution principle of hashing causes small changes in a key to radically alter the final hash value. It’s also big (and bad) news when collisions are found with a new hash algorithm; very recently—I can’t find the article, or I’d link it here—developers started warning about trusting the “short” hashes used by Git to mark commits, tags, and authors. These are only 32 bits long, instead of the usual 256, so collisions are a lot more likely, and it’s not too hard to pad out a malicious file with enough garbage to give it the same short hash as, say, the latest Linux kernel.

Summing up

Hash tables aren’t the only way to implement associative arrays. For some cases, they’re nowhere near the best. But they do fill the general role of being good enough most of the time. Computing hash codes is the most time-consuming part, with handling collisions and overfull buckets following after that. Unfortunately for higher-level programmers, you don’t have a lot of control over any of these aspects, so optimizing for them is difficult. However, you rarely have to, as the library writers have done most of that work for you. But that’s what this series is about. You may never need to peer into the inner workings of an associative array, but now you’ve got an idea of what you’d find down there.

Summer Reading List 2016: Wrap-up

So it’s Labor Day, and we’ve reached the unofficial end of another summer. Last month, I posted my progress in my Summer Reading List challenge. I had read 2 out of 3 then, and I’ve since finished the third.

Literature

Title: New Atlantis
Author: Sir Francis Bacon
Genre: Fiction/literature
Year: 1627

Yes, you read that year right, this is a work almost four centuries old. You can find it at Project Gutenberg if you want to read it for yourself. That’s where I got it, and I’m glad I looked it up. Genre-wise, I’m not sure what to call it, so I went with the catchall of “literature”.

New Atlantis is what we’d call a short novella today, but that’s mainly because it was never truly finished. It’s also an incredibly interesting text for its vision. Written like many old stories purporting to be travelers’ tales, it describes the utopian land of Bensalem, supposedly located somewhere out in the Pacific. The inhabitants of that land are far advanced (compared to the 17th century) and living in a veritable paradise of wisdom and knowledge.

By my personal standards, however, it reads more like a dystopia: despite professing a very progressive separation of church and state, for example, Bensalem is hopelessly rooted in Christianity, to the point where even the Jews living there (the narrator meets one) lie somewhere in the “Jews For Jesus” range. The whole place seems to be governed in a very authoritarian manner, where societal norms are given force of law—or the other way around. Yes, Bacon describes a nation better than any he knew, but I would take modern America, with all its flaws, over the mythical New Atlantis every time.

But people today rarely look at those parts of the text. Instead, they’re more focused on what the scientists of Bensalem have done, and this is described in some detail at the very end of the work. Bacon’s goal here is to overwhelm us with the fantastic creations, but they read like a laundry list of the last hundred years. If you read it right, you can find airplanes, lasers, telephones, and all kinds of other things in there, all predicted centuries ago. And that is the real value of the book. It’s further proof that earlier ages did not lack for imagination; their relatively unadvanced state was through no fault of their minds. As an author myself, I find that information invaluable.

Next year?

I had fun with this whole thing. I read something I never would have otherwise, and I pushed myself outside my normal areas of interest. I’m not sure I’m ready to make this a regular, annual occurrence, but it seems like a good idea. I hope you feel the same way.

Alien grammars

When making an “alien” conlang (however you define that), it’s easy to take the phonology half, make it outrageous, and call it a day. But that’s only half the battle. There’s more to a language than its sounds, and if you’re designing a conlang for anything more than naming, you still need to look at the grammar, too.

So how can we make the grammar of a language “feel” otherworldly? As with the sounds, the easiest way is to violate the traditional expectations that we, as speakers of human languages, have developed. To do this, however, we need to know our preconceptions, and we also need to take a look at how grammar really works.

The foundation of grammar

I can’t claim to understand the mental underpinnings of language. I bought a book about the subject years ago, but I’ve never had the chance to read it. What follows comes from articles, other conlangers, and a healthy dose of critical thinking.

Language is a means of communication, but it’s also inextricably linked to cognition, to memory. The human brain is a wonderful memory device (if you discount those pesky problems of fuzzy recollection, dementia, etc.) that works in a fascinating way. At its core, it seems to be primarily an associative memory. In other words, we remember things by their association with what we already know. Our language reflects that. Nouns are things, verbs are actions, adjectives are states or conditions; not for nothing are they all taught by means of pictures and examples. Those build the associations we use to remember them.

Is it possible that an alien intelligence doesn’t work this way? Sure, but that would be so “out there” that I can’t begin to contemplate it. If you want to try, go ahead, but it won’t be easy. On the other hand, that’s one way to get something totally different from what we know. I just wouldn’t want to try and describe it, much less use it.

Moving on to “actual” linguistics, we’re used to the traditional parts of speech, the trinity of noun, verb, and adjective. On top of them, we often toss in adverbs, articles, prepositions, and the like, but those aren’t quite as “core” as the big three. Indeed, many languages get by just fine without articles, including Latin and Japanese. Adverbs are so nebulously defined that you can find words in any language that fit their category, but there are plenty of examples of languages using adjectives in their place. Prepositions (more generally, adpositions) aren’t entirely necessary, either; most of their function can be replaced by a large set of case markers.

But it seems awfully hard to ditch any of the big three. How would you make a language without verbs, for instance? Like the “pure functional” approach to computer programming, it would appear that nothing could be accomplished, since there’s no way to cause changes. Similarly, a “nounless” conlang couldn’t name anything. For adjectives, it’s not so bad, as state verbs can already take their place in a few natural languages, but it’s difficult to imagine a total lack of them.

That hasn’t stopped industrious conlangers from trying. Languages without verbs/nouns/adjectives are a perennial favorite in any circle. I can’t say I’ve attempted one myself, but I can see how they might work, and any of the three looks very alien to me.

  • Getting rid of adjectives is the easiest. As above, most can be replaced by state verbs. A phrase like “the red door”, for instance, might come out as something like “the door that is red” or “the door it is red”. The difference is that adjectives are often (but not always) marked as if they were nouns, while a state verb like this would instead be conjugated like any other verb in the language.

  • Dropping verbs is much harder. You can look into languages that lack copular verbs for examples here, though the same idea can be extended to most of the “predicating” verbs, like English “to have”, “to become”, etc. Pronouns, case markers, and liberal use of adjectives can take care of most of it, but it’ll definitely feel weird.

  • Throwing out nouns is next to impossible, in my opinion. Not to say you should give up your ambitions, but…I’m not sure I can help you here. A language without nouns may truly be beyond our comprehension. Perhaps it’s the language of some mystical or super-advanced race, or that of a hive mind which has no need for names. I honestly don’t know.

Alternate universal

Much simpler than tossing entire categories of words is just finding new ways to use them. Most (I emphasize this for a reason) languages of the world follow a set of linguistic universals, as laid out by linguist Joseph Greenberg. They don’t follow all of them, mind you, but it’s better than even odds. Some of the more interesting ones include:

  • #3: VSO languages are prepositional. This comes from their “head-first” word order, but it’s easy to invert.
  • #14: In conditional clauses, the conclusion (“then” part) normally follows the condition (“if” part). Even in English, it’s not hard to find counterexamples, if you know where to look. (See what I did there?) But it’s not the usual form. In an alien conlang, it could be.
  • #34: Languages with a dual number must also have a plural; those with a trial (three of something) have a dual and a plural. No real reason this has to be so, not for aliens. I’d like to know how you justify it, though.
  • #38: Any language with case can only have a zero marker for the case representing an intransitive subject—nominative, absolutive, etc. If you’ve got a different way of distinguishing cases, then there’s no reason you have to follow this rule, right?
  • #42: All languages have at least three person and two number distinctions for pronouns. Another one where it’s not too hard to see the “alien” point of view.

Conclusion

Grammar is a huge field, and we’ve barely taken the first step into it. Even if you don’t make something completely outlandish as in the above examples, you can still create an alien grammar out of more mundane building blocks. There are thousands of languages in the world, and many have rare or unique features that could find a home in your alien conlang. A number for “a few”? Sure, it works for the Motuna language of Papua New Guinea. Want a case for things you’re afraid of? A few of the Aboriginal languages of Australia can help you there…if there are any native speakers left alive when you start. The list goes on for just about anything you could think of. All you have to do is look, because, linguistically, aliens are among us.