internals – Prose Poetry Code

Software internals: floating-point

Integers are the basis for all computer calculation, but they’re not the only kind of numbers out there. Floating-point numbers are just as important when interfacing with the real world. They represent decimals, fractional quantities, anything other than simple whole numbers. But too many programmers use them without understanding them, and that tends to go horribly wrong.

First off, let’s get one link out of the way. What Every Computer Scientist Should Know About Floating-Point Arithmetic describes everything you need in more technical and precise terms than I ever could. It’s a good read, and it’s important stuff, so check it out if you want the gritty details.

Now, we’re fortunate to live in today’s world, because floating-point numbers are essentially standardized. IEEE 754 (along with some later revisions) defines a common floating-point format that’s used pretty much everywhere in modern tech. If it isn’t, then it’s still assumed to be. So we’ll base our discussion on that.

The theory

Floating-point numbers work a bit like scientific notation. In decimal, you can write something like 1.42 × 10^4^, and that’s understood to be the same as 14,200. But computers work in binary, so we need a binary form of scientific notation. In this case, for example, we can write 14,200 in binary as 11011101111000, or 1.1011101111 × 2^13^.

From there, we can create a way of packing this notation into a sequence of bits: floating-point numbers. What do we need? Well, each number can be either positive or negative, so we need some way of showing that. And we’ll have to store both the exponent (e.g, 13) and the mantissa (binary 1.101110111). The base (2, for binary) can be implied, as we know we’re working with binary. Put those three parts—mantissa, exponent, and sign—together, and you’ve got floating-point.

The practice

But it’s not as easy as that, and that’s why we have standards. First, how many bits are you going to use? Too few, and you don’t have much range. Too many, and you waste space on inconsequential fractions. However, sometimes you need those less-significant bits, so you might want to have options. Luckily, the standard gives us two main options: 32 and 64 bits. Unluckily, a lot of programming languages (like JavaScript) limit you to the latter. Some, like C, give you the choice between float and double (the latter meaning “double precision”, because that’s about what you get with more bits), but high-level programmers often don’t have that luxury. Since the “big” high-level languages tend to use 64-bit floating-point, then, we’ll look at it first.

Given our 64 bits, we need to divide them up among our three parts. The sign bit obviously only needs one, so that’s that. Of the remaining 63, the standard devotes 53 to the mantissa and 11 to the exponent. That lets us store binary exponents over 1000 and the equivalent of about 15 digits of precision. Add in a few special tricks (like denormalized numbers), and the total range runs from 10^-308^ to 10^308^. Huge. (32-bit still nets you 7 digits in a range of 10^±38^, which isn’t too shabby.)

Now, those of you better at math may have noticed a calculation error above. That’s intentional. The way IEEE 754 works, it saves a bit by a clever ruse. In decimal scientific notation, as you may know, the number to the left of the decimal point can’t be zero, and it has to be less than 10. (Otherwise, you could shift the point left or right one more spot.) The same is true for binary, but with a binary 10, i.e, 2. But there’s only one binary number that fills that role: 1. With a few exceptions, you’re always going to have the 1, so why bother putting it in?

The problem with this “implied” 1 comes when you have the one number that has no 1 anywhere in it. That, of course, is 0. But it’s okay, because the standard simply makes 0, well, 0. Exponent zero, mantissa zero. Sign…well, that’s different. Standard floating-point representation has two zeroes: negative and positive. They’re treated as equal essentially everywhere, but they do differ in that one sign bit.

The IEEE standard also does an odd thing with its exponents. Except for the case of a literal 0, every exponent is biased. For 64-bit numbers, the number 1023 is added to the exponent, so a number like 2.5 (binary 10.1 or 1.01 × 2^1^) would be stored as if it were 1.01 × 2^1024^. Why? Because it makes sorting and comparison easier, or so they claim.

In the rare event that you go outside the range, you get to infinity. Like zero, we’ve got two forms of that, one for either sign, but they’re considered nearly the same.

And then there’s NaN. This is a special value used mainly to make programmers scream, but it also represents invalid results like dividing by zero or taking the square root of a negative number. NaN is special in that it’s a whole class of values (anything with all bits in the exponent field set to 1), but they’re completely different. NaN equals nothing, not even another NaN. It’s a null value and an error code at the same time, which is where things inevitably go wrong.

Care and feeding

NaN, though, is only one of the pitfalls of using floating-point. You also have to watch out for infinities, since they don’t play nice with finite numbers. Also, unless you have a really good reason for doing so (such as being John Carmack), you probably don’t want to mess with the bits themselves.

More important than knowing how to use floating-point numbers is when to use them. Or, rather, when not to. They do give you precision, often more than you need, but sometimes that’s not enough. Take the classic example of 1/3. In decimal, it’s an endless string of 3s. Binary changes that to a repeating pattern of 01, but the principle is the same. No matter how many digits or bits you’ve got, you’re never getting to the end. So the simple code 1.0 / 3.0 will never give you exactly 1/3. It can’t. The same goes for any other fraction whose denominator isn’t exactly a power of two. So, if you need exact representation of an arbitrary rational number, floating-point won’t help you.

For 1/100, it’s no different, and that’s why floating-point isn’t a great idea for money, either. Sure, for most simple purposes, it’s close enough, but those tiny errors do add up, especially when multiplication and division get involved. If you’re serious about your money, you won’t be storing how much you have in a floating-point number. Instead, you’ll likely want a decimal type, something a lot of business-oriented languages offer.

In the general case, however, floating-point is the solution. You just have to know its limitations.

Software internals: associative arrays and hashing

We’ve already seen a couple of options for storing sequences of data in this series. Early on, we looked at your basic array, for example, and then we later studied linked lists. Both of those are good, solid data structures. Linear arrays are fast and cheap; linked lists are great for storing an arbitrary amount of data for easy retrieval. There’s a reason these two form the backbone of programming.

They do have their downsides, however. Linear arrays and lists both suffer from a problem with indexing. See, a piece of data in an array is accessed by its index, a number you have to keep track of if you ever want to see that data again. With linked lists, thanks to their dynamic structure, you don’t even get that. Finding a value in a list, if you don’t already know what you’re looking for, is practically impossible.

By association

What we need is a structure that gives us the flexibility of a list but the indexing capabilities of an array. Oh, and can we have an array whose indexes are keys we decide, instead of just numbers? Sure, and I bet you want a pony, too. But seriously, we can do this. In fact, your programming language of choice most likely already does it, and you just never knew. JavaScript calls them “objects”, Python “dictionaries”, C++ “maps”, but computer science generally refers to this structure as the associative array. “Associative”, because it associates one value with another. “Array” tells you that it functions more like a linear array, in terms of accessing data, than a list. But instead of computer-assigned numbers, the indexes for an associative array can be just about anything: integers (of your choosing), strings, even whole objects.

The main component of an associative array is the “mapping” between keys (the indices used to access the data) and the values (the actual data you’re storing in the array). One simple way to create this mapping is by making a list of key-value pairs, like so:

class AssociativeArray {
    constructor() {
        // Pairs are objects with 2 attributes
        // k: key, v: value
        this.pairs = []
    }

    get(key) {
        let value = null;
        for (let p of this.pairs) {
            if (p.k === key) {
                value = p.v;
                break;
            }
        }

        return value;
    }

    insert(key, value) {
        if (!this.get(key)) {
            // Add value if key not already present
            this.pairs.push({k: key, v: value});
        } else {
            // *Update* value if key present
            this.pairs = this.pairs.map(
                p => (p.k === key) ?
                    {k: key, v: value} :
                    p
            );
        }
    }

    remove(key) {
        this.pairs = this.pairs.filter(p => (p.k !== key));
    }
}

(Note that I’m being all fancy and using ES6, or ES2015 if you prefer. I’ve also cheated a bit and used builtin Array functions for some operations. That also has the dubious benefit of making the whole thing look more FP-friendly.)

This does work. We could create a new instance of our AssociativeArray class, call it a, and add values to it like this: a.insert("foo", 42). Retrieving them is as easy as a.get("foo"), with a null result meaning no key was found. (If you want to store null values, you’d have to change this to throw an exception or something, which is probably better in the long run, anyway.)

The problem is, it’s not very fast. Our simple associative array is nothing more than a glorified linked list that just so happens to store two values at each node instead of one. For small amounts of data, that’s perfectly fine. If we need to store a lot of information and look it up later, then this naive approach won’t work. We can do better, but it’ll take some thought.

Making a hash of things

What if, instead of directly using the key value, we computed another value from it and used that instead? That might sound silly at first, but think about it. The problem with arbitrary keys is that they’re, well, arbitrary. We have no control over what crazy ideas users will come up with. But if we could create an easy-to-use number from a key, that effectively transforms our associative array into something closer to the familiar linear array.

That’s the thinking behind the hash table, one of the more popular “back ends” for associative arrays. Using a special function (the hash function), the table computes a simple number (the hash value or just hash) from the key. The hash function is chosen so that its output isn’t entirely predictable—though, of course, the same input will always return the same hash—which distributes the possible hash values in a random-looking fashion.

Hashes are usually the size of an integer, such as 32 or 64 bits. Obviously, using these directly is not an option for any realistic array, so some sacrifices have to be made. Most hash tables take a hybrid array/list approach. The hash “space” is divided into a number of buckets, chosen based on needs of space and processor time. Each bucket is a list (possibly even something like what we wrote above), and which bucket a value is placed in is determined by its hash. One easy option here is a simple modulus operation: a hash table with 256 buckets, for instance, would put a key-value pair into the bucket corresponding to the last byte (hash % 256) of its hash.

In code, we might end up with something like this:

class HashTable {
    constructor() {
        this._numBuckets = 256;
        this._buckets = new Array(this._numBuckets);
        this._buckets.fill([]);
    }

    _findInBucket(key, bucket) {
        // Returns element index of pair with specified key
        return this._buckets.findIndex(p => (p.k === key));
    }

    _update(key, value, bucket) {
        for (let i = 0; i < this._buckets[bucket].length; i++) {
            if (this._buckets[bucket][i] === key) {
                this._buckets[bucket][i].v = value;
                break;
            }
        }
    }

    set(key, value) {
        let h = hashCode(key);
        let bucket = h % this._numBuckets;
        let posInBucket = this._findInBucket(key, bucket);

        if (posInBucket === -1) {
            // Not in table, insert
            this._buckets[bucket].push({k: key, v: value});
        } else {
            // Already in table, update
            this._update(key, value, bucket);
        }
    }

    get(key) {
        let h = hashCode(key);
        let bucket = this._buckets[h % this._numBuckets];
        let index = this._findInBucket(key, bucket);

        if (index > -1) {
            return bucket[index].v;
        } else {
            // Key not found
            return null;
        }
    }
}

The function

This code won’t run. That’s not because it’s badly-written. (No, it’s totally because of that, but that’s not the point.) We’ve got an undeclared function stopping us: hashCode(). I’ve saved it for last, as it’s both the hardest and the most important part of a hash table.

A good hash function needs to give us a wide range of values, distributed with little correlation to its input values so as to reduce collisions, or inputs leading to the same hash. For a specific input, it also has to return the same value every time. That means no randomness, but the output needs to “look” random, in that it’s uniformly distributed. With a hash function that does this, the buckets will remain fairly empty; optimally, they’d each only have one entry. The worst-case scenario, on the other hand, puts everything into a single bucket, creating an overly complex linked list with lots of useless overhead.

There are a lot of pre-written hash functions out there, each with its own advantages and disadvantages. Some are general-purpose, while others are specialized for a particular type of data. Rather than walk you through making your own (which is probably a bad idea, for the same reason that making your own RNG is), I’ll let you find one that works for you. Your programming language may already have one: C++, for one, has std::hash.

Incidentally, you may have already seen hash functions “in the wild”. They’re fairly common even outside the world of associative arrays. MD5 and SHA-256, among others, are used as quick checksums for file downloads, as the uniform distribution principle of hashing causes small changes in a key to radically alter the final hash value. It’s also big (and bad) news when collisions are found with a new hash algorithm; very recently—I can’t find the article, or I’d link it here—developers started warning about trusting the “short” hashes used by Git to mark commits, tags, and authors. These are only 32 bits long, instead of the usual 256, so collisions are a lot more likely, and it’s not too hard to pad out a malicious file with enough garbage to give it the same short hash as, say, the latest Linux kernel.

Summing up

Hash tables aren’t the only way to implement associative arrays. For some cases, they’re nowhere near the best. But they do fill the general role of being good enough most of the time. Computing hash codes is the most time-consuming part, with handling collisions and overfull buckets following after that. Unfortunately for higher-level programmers, you don’t have a lot of control over any of these aspects, so optimizing for them is difficult. However, you rarely have to, as the library writers have done most of that work for you. But that’s what this series is about. You may never need to peer into the inner workings of an associative array, but now you’ve got an idea of what you’d find down there.

Software internals: Trees

We’ve talked quite a bit about data structures in this series, although not as much in recent entries. We’ve seen lists and arrays, two of the most important structures around, but they both have a severe limitation: they’re linear. While that’s great for a lot of data that computer programs manipulate, there comes a time when you need to represent both the data and its organization, and that calls for something more substantial.

The tree is a data structure that represents a hierarchy. Like a real tree, it has a root. It has branches and leaves. And a collection of data trees is also called a “forest”. But those similarities are a bit artificial; the terms were chosen for the analogy. What these trees do isn’t much like the natural ones. To a programmer, however, they’re far more useful.

In the abstract

A tree starts with a root. This is a data node, like one in a linked list, and it can hold one value of any type. If you’ll recall, nodes in linked lists also carry along a pointer to the following node (and the preceding one, for a doubly-linked list). Well, trees do something like that, too. The root node, in addition to its data value, also contains a list of pointers to its children.

These children are root nodes, and thus trees in their own right. Each will have its own data and list of children, and those, in turn, will have the same. If the child list is empty, then that node is at the “end” of one branch; it’s a leaf node. (If there are children, you might be wondering, are there parents? Yes. A node having another node in its child list is that second node’s parent. But data trees are asexual: each child only has one parent.)

Trees are a perfect example of recursion. A tree’s children are also trees—empty ones, if they’re leaf nodes—and they can be treated as such. The most used algorithms lend themselves naturally to a recursive style even in imperative languages. So that’s how we’ll be looking at them here.

Growing a tree

Trees consist of nodes, and each node has two main parts: a value and a list of children. Thus, a simple way of representing a tree is as an object (or similar structure) with two fields, like so. (This example uses TypeScript for illustration purposes.)

class TreeNode {
    value: any;
    children: TreeNode[];
}

Not that hard, huh? It’s nothing more than putting our words into code. A TreeNode‘s value could be anything, though you’d probably restrict it in a real application, and its children are more TreeNodes. Adding, removing, and altering children works in almost the same way as with linked lists, at least with this “basic” tree setup.

Traversing the tree (sometimes called walking it) is a different matter. Here, we want to go through each of the root’s children, then their children, and so on, until we’ve visited every node in the tree. There are two main ways to do this, but the depth-first method is more common and easier to explain. In code, it looks about like this:

function traverse(tree, callback) {
    for (var child of tree.children) {
        traverse(child, callback);
        callback(child);
    }
}

This algorithm is “depth-first” because it visits each of a node’s children before moving to its siblings. This is where the recursion comes into play. We loop through each of the children of the root node. They, in turn, become the roots of a new depth-first traversal. Only when we’ve exhausted the “depth” of the tree—when we reach a leaf node—do we start doing anything. (We’re assuming here that our tree doesn’t have any cycles, child nodes that point to higher levels of the tree. Those make things much more difficult.)

Now, there are a lot of things you can do with this function, but I’ve limited it to a simple, nebulous callback that is run for each node. In the abstract, that’s all there is to it. That simple block of code effectively describes the operation of some code generators, AI systems, and many other complex actions.

Special kinds of trees

The general tree described above suffices for a great many applications. Sometimes, though, it’s more efficient to be more restricted. Computer scientists have therefore developed a number of specializations of the tree for different uses.

The binary tree is probably the most well-known of these. It’s called “binary” because each node has exactly two children: left and right. These children are themselves trees, of course, but empty trees are often represented as null values. Here’s what it looks like in code, once again using TypeScript:

class BinaryTree {
    data: any;
    left: BinaryTree;
    right: BinaryTree;
}

Binary trees add an extra wrinkle to the depth-first traversal we say earlier. Now, we have three possible ways to do things: pre-order, in-order, and post-order traversal. The only thing that changes is when we process the “current” node.

function preOrder(node, callback) {
    if (node == null) return;

    callback(node.data);
    preOrder(node.left, callback);
    preOrder(node.right, callback);
}

function inOrder(node, callback) {
    if (node == null) return;

    inOrder(node.left, callback);
    callback(node.data);
    inOrder(node.right, callback);
}

function postOrder(node, callback) {
    if (node == null) return;

    postOrder(node.left, callback);
    postOrder(node.right, callback);
    callback(node.data);
}

Each approach has its pros and cons. Pre-order works for copying trees, and it’s essentially how expressions are represented in abstract syntax trees or ASTs, an important part of compilers and interpreters. In-order traversal is best for sorted trees, where an extra “key” value is added to each node, arranged so that it’s greater than all the keys in its left subtree but smaller than all those in the right; these binary search trees are used for set structures, lookup tables, and the like. (Maybe we’ll look at them in more detail later in the series.) Finally, post-order is the option of choice for deleting a node’s children and for making RPN expressions.

The binary tree also has a few of its own alterations. The red-black tree uses a “color bit” to keep the tree balanced. This is helpful because most algorithms working on binary trees work best when there the total numbers of left and right nodes are about equal. (In the worst case, all of a tree’s valid descendants are on one “side”. Then, you’re left with just a more expensive linked list.) The AVL tree is a similar variation that has its own pluses and minuses.

B-trees are a special case that sits between binary trees and the general category of trees. They’re allowed to have more than two children, but only up to a certain number. The trick is that each node is given a number of sort keys (like the one in a binary search tree), one for every child past the first. These are calculated so they fit “between” the children, allowing them to serve as index values, speeding up traversal. B-trees are mostly used in databases and filesystems—the two are practically the same thing these days—including the Windows filesystem of choice, NTFS, Apple’s HFS+, and Ext4 and Btrfs on Linux.

Seeing the forest

Trees are everywhere in programming. Whenever you have a set of data with obvious parent-child relationships, it’s probably going to be represented in tree form. Programming languages often don’t offer direct access to the trees, but they’re in there. Sets, maps, and hash tables all use them under the hood. Any kind of parsing, such as XML or a game’s custom scripting language, is going to be tree-based. Even if you never see them, you’ll be using them somewhere.

Fortunately, if you’re not dealing with the innards of the structures themselves, there’s not much you need to worry about. All the hard work has been taken care of. No one needs to manually balance a binary search tree, for instance. But if you work with them directly, it helps to know how they work.

Software internals: Searching

Big Data is all the rage today, but what good is all that data if you can’t find what you’re looking for? Google, among others, has made billions of dollars by using and developing search algorithms, the backbone of analytics. While I couldn’t even begin to show you how their algorithms work (nor would you likely ever need to know), we can look at their precursors, the simple searches common to many programming languages.

Brute force

The simplest way to search for an item in a list (a value in an array, a substring in a larger body of text, or whatever) is to loop through the list and check each value to see if it’s the one you want. This is a linear search, and it’s not hard to implement, even in JavaScript:

function linear_search(array, value) {
    for (var i = 0; i < array.length; i++) {
        if (array[i] == value) {
            return i;
        }
    }

    return -1;
}

There’s more than one way to do it, but that’s the most straightforward. It’s also the “classic” form familiar to programmers of C and related languages. But it comes at a price: execution time. Linear searches are O(n); the time they take is dependent on the number of items you’re searching over. (Statistically speaking, a linear search of a list of size N will, on average, take N/2 iterations.) Surely we can do better, right?

Divide and conquer

If our list is sorted, we can take an approach that you might recognize from “higher or lower” number guessing games. The binary search starts at a point somewhere in the middle of a list and checks to see if it’s at the right spot. If it’s not, then—because the list is sorted—it knows which way to go for the next guess. It looks a bit like this:

function binary_search(array, value) {
    var idmin = 0;
    var idmax = array.length - 1;

    while (idmin < idmax) {
        // Note: older versions of JS use Math.floor() here
        var midpoint = idmin + Math.trunc((idmax - idmin) / 2);

        if (array[midpoint] == value) {
            return midpoint;
        }
        else if (array[midpoint] < value) {
            idmin = midpoint + 1;
        }
        else {
            idmax = midpoint - 1;
        }
    }

    return -1;
}

Again, there are other ways of writing this function, but the one shown here works for our purposes.

The binary search requires a sorted array. That’s how it’s able to deduce where in the list, relative to its pointer, the correct value is. If you ignore the sorting step (probably because you were going to do it anyway), then it comes out to O(log n) time. That’s faster than a simple linear search, and good enough to put a version of the algorithm into the standard libraries of C (bsearch()), C++ (std::binary_search()), and Java (Arrays.binarySearch()), among others.

Substrings

Searching strings for a specific substring is its own problem with its own considerations. If the substring you’re looking for is only a single character, then the whole exercise degenerates into a case of the above, and you can use linear search (as C’s strchr() likely does behind the scenes). But searching for multi-character strings is a bit harder. A naive approach might look like this:

function substring_search(text, ss) {
    var initial = ss[0];
    var found = false;

    for (var i = 0; i < text.length - (ss.length - 1); i++) {
        if (text[i] == initial) {
            found = true;
            for (var j = 1; j < ss.length, found == true; j++) {
                if (text[i+j] != ss[j]) {
                    found = false;
                }
            }

            if (found) {
                return i;
            }
        }
    }

    return -1;
}

Basically, this is a linear search for the first character of the search string, but with a twist. Whenever that first character is found, an inner loop runs, checking the following characters in the text to see if they match as well. If they do, then we’ve got a perfect match. Otherwise, we keep going from where we left off.

This “dumb” approach takes O(n+m) time, but it’s hard to improve on it without creating some more complicated data structures. Fortunately, that’s exactly where the series is going next, so stay tuned.

Software internals: Sorting

We’ve looked at quite a few data structures in this series, from simple arrays to objects. Now, let’s turn our focus to algorithms. One class of algorithms is very important to high-level programming: the sorting algorithms. For decades, computer scientists have been developing new ways to sort data, trying to balance the needs of size and speed, and there’s no silver bullet. Some sorting algorithms are good for general use, but they fall apart on certain degenerate cases. Others are bad all around, but they’re interesting from a teaching perspective.

In this post, we’ll look at three specific algorithms that illustrate the evolution of sorting and some of its trade-offs. Mostly, we’ll treat each one as working on a simple array of integers that we want sorted in increasing order, but it’s easy enough to generalize the sort function to whatever you need. Also, the code will all be in JavaScript, but I won’t be using any trickery or obfuscation, so it shouldn’t be too hard to convert to your favorite language. (Of course, most languages—JavaScript included—already have sorting functionality built into their standard libraries, so it’s better to use those than to write your own.)

Simple and stupid

Bubble sort is the first sorting algorithm many coders see. It’s stupidly simple: move through a list, comparing each element to the one before it. If they’re out of order, swap them. When you get to the end, start over, and repeat until the list is sorted. In code, it might look like this:

function bubblesort(arr) {
    var len = arr.length;

    do {
        var swapped = false;

        // run through the list
        // compare each element to the last
        // swap those that are out of order
        for (var i = 1; i < len; i++) {
            if (arr[i] < arr[i-1]) {
                // swap elements
                var temp = arr[i];
                arr[i] = arr[i-1];
                arr[i-1] = temp;
                swapped = true;
            }
        }

        // optimization (see below)
        len--;

        // repeat until everything's sorted
    } while (swapped);
}

Even this is already optimized a bit from the basic bubble sort method. Because of the way higher values “bubble up” to the end of the list (hence the name “bubble sort”), we know that, after n iterations, the last n values will be sorted. Thus, the line len-- tells the loop to ignore those values.

That optimization doesn’t help the overall performance of bubble sort. It remains O(n²), not very good. But bubble sort’s value is in its simplicity. You wouldn’t use it in the real world, but it’s a good way to show how sorting works. It’s easy to follow along with it. Take a handful of small numbers and try sorting them by hand, maybe even on paper. It’s not that hard.

Quick and dirty

For a few decades, the go-to choice for sorting has been quicksort. It’s the reason that C’s standard sorting function is called qsort, and it’s still used in many languages, both for performance and simplicity. (Haskell lovers will be quick to point out that that language can implement quicksort in two lines of code.)

Quicksort works by a recursive approach that boils down to “divide and conquer”. Imagine a list as a line of values. Now pick one of those values as the pivot. Take everything less than the pivot and sort it, then take everything greater, and sort that. (This is the recursive step.) Your sorted list is everything in the first sublist, then all the values equal to the pivot, then the second sublist. Or, in code:

function quicksort(arr) {
    if (arr.length <= 1) {
        // recursion base case
        return arr;
    }

    // find the pivot value
    var pivotIndex = parseInt(arr.length / 2);
    var pivotValue = arr[pivotIndex];

    // these will hold our sublists
    var left = [], right = [], pivots = [];

    // partition the list into three:
    // 1. less than the pivot
    // 2. greater than the pivot
    // 3. equal to the pivot
    for (var i = 0; i < arr.length; i++) {
        if (arr[i] < pivotValue) {
            left.push(arr[i]);
        } else if (arr[i] > pivotValue) {
            right.push(arr[i]);
        } else {
            pivots.push(arr[i]);
        }
    }

    // the sorted list is (left + pivots + right)
    return quicksort(left).concat(pivots).concat(quicksort(right));
}

That’s essentially an expanded version of the Haskell two-liner. It’s not the best from a memory or speed standpoint, but it works to show you the way the algorithm works. Another way works in-place, directly operating on the array by swapping elements around so that all the values less than the pivot are placed before it, then putting those greater after it, and then recursing on the resulting partially-sorted list. That one is a lot faster, but it’s a bit harder to grasp. It also needs either a helper function or a bit of logic to allow both sorting of an entire list and of a portion of one.

The gains from that added complexity are huge, though. With it, quicksort becomes one of the faster sorting methods around, and its space efficiency (with the in-place version) can’t be beat. That’s why quicksort remains so popular, even despite some well-known shortcomings. It’s good enough for most purposes.

Good all around

The last of the “simple” sorts we’ll look at is merge sort. This one is very much like quicksort in that it uses a strategy of repeatedly subdividing a list, but it works without a pivot element. At each step, it breaks the list in half and sorts each half separately. (A list with only one element is sorted by definition, so that’s the stopping point.) Then, it merges those halves an element at a time. Here’s the code:

function mergesort(arr) {
    // An element with 0-1 elements is always sorted
    if (arr.length < 2) {
        return arr;
    }

    // break the list into halves
    var middle = arr.length / 2;
    var left = arr.slice(0, middle);
    var right = arr.slice(middle);

    // sort each half separately
    left = mergesort(left);
    right = mergesort(right);

    // now merge the halves
    // take the 1st element of each array, and compare
    // the lower one moves into the "result" list
    // repeat until there's nothing left
    var result = [];
    while (left.length && right.length)
    {
        if (left[0] <= right[0])
        {
            result.push(left.shift());
        } else {
            result.push(right.shift());
        }
    }
    // add in anything we didn't get
    // (in case we had uneven lists)
    result = result.concat(left).concat(right);

    return result;
}

Merge sort uses more space than a good quicksort, but it’s generally faster, and it’s especially good for sorting linked lists. It isn’t the fastest overall, however, nor is it the best for tight memory situations. But it’s a happy medium, all else being equal. Not that it ever is.

Never enough

These aren’t the only sorting algorithms around. They’re merely the easiest to explain. Heapsort is another good one that gets used a lot. Radix sort works best with a lot of data that can be easily indexed. And there are many more than that. There’s even a site that has more details and graphical visualizations of many of the algorithms, including those we’ve discussed in this post.

Of course, it’s very likely that you’ll never need to implement a sorting algorithm yourself. Almost every language already has dozens of them implemented either in the standard library or in open 3rd-party code. Most are even fine-tuned for you. But this whole series is about peeling back those layers to see how things work on the inside. Calling array.sort() is easy, but there are times when it might not be the best option. Learning the algorithms—how they work, their advantages and disadvantages—can give you the knowledge to find those spots when the “standard” sort is a bad idea. That’s a rare case, but finding it just once pays for itself.

Next up, we’ll look at the other side of sorting: searching. With Google and “Big Data”, searching has become more important than ever, so stay tuned.

Software internals: Lists

Up to now, we’ve been looking at the simplest of data structures. Arrays are perfectly linear. Strings are mostly so, until you add in the complications of modern character sets. Objects need a little bit more help, but they can often be reduced to arrays of data. Not so this time around.

Lists are incredibly important to a lot of programming work. In today’s popular languages, they’ve essentially taken over the fundamental position once occupied by arrays. In Python, Ruby, and so many other languages, it’s the list that is the main sequence. In Lisp and Clojure, it’s the universal one, in the same way that NAND is the universal gate in electronics: everything else can be built from it.

But why are lists so useful? Unlike arrays, they lend themselves to dynamic allocation. Elements packed into an array don’t have anywhere to go. If you want to resize the structure, you often have to move everything around. In C++, for example, a vector (essentially a dynamically-sized array) can have elements added and removed, but that can trigger a full resizing, which can move the vector to a different address in memory, thus rendering all your pointers and iterators useless. Higher-level languages don’t have these problems with arrays, but lists never have them in the first place. They’re made to be changed.

Linked lists

The most common method for making a list is the linked list. At its core, a linked list is nothing but a sequence of nodes. A node is typically defined as a structure with a data member (one piece of the data you’re storing in the list) and a link member. This will usually be a pointer to the next node in the list, or, for the last item, a null pointer. In code, it might look like this:

// It is "item type"
struct Node<It>
{
    It data;
    Node<It>* next;
};

A list, then, could be represented as nothing more than a pointer to the first Node. By “walking” through each next pointer, a function can visit each node in order. And that’s all there really is to it.

Working with this kind of linked list isn’t too hard. Finding its length, a common operation in code, doesn’t take much:

// This is assumed throughout, but we'll make it explicit here
using List<It> = Node<It>*;

size_t length(List<It> li)
{
    size_t counter = 0;

    while (li)
    {
        ++counter;
        li = li->next;
    }

    return counter;
}

There are plenty of optimizations we can do to improve on this (it’s O(n)), but it should illustrate the basic idea. If you prefer a functional approach, you can do that, too:

// FP version
size_t length(List<It> li)
{
    if (!li)
        return 0;
    else
        return 1 + length(li->next);
}

That one looks a lot better in a proper FP language, but I wanted to stick to a single language for this post.

Inserting new elements is just manipulating pointers, and changing their value can be done by altering the data members. In general, that’s all you need to know. Even higher level languages are largely based on this same foundation, no matter what interface their lists present to the programmer.

Another option

The drawback of linked lists is the time it takes to find anything in them. Most operations are going to require you to walk some, if not all, of the list. The bigger the list gets, the longer it takes to walk it. In other words, it’s O(n) time.

Different systems get around this in different ways. One possibility is to forgo the use of linked lists entirely, instead basing things around an array list. This is nothing more than a dynamically-sized array like the C++ vector or Java ArrayList, but it can be used like a list, except that it also has a random access interface.

Most of the time, an array list will have a reference to its own block of memory, enough to hold all its current elements plus a few extra. When it runs out of that space, it’ll allocate a new, larger buffer, and move everything to that. On the inside, it would look something like:

struct AList<It>
{
    size_t capacity;
    size_t max_capacity;
    It* data;
}

Array lists work a lot better in an object-oriented system, because you can use methods to simplify the interface, but there’s no reason you need them. Here, for example, is a non-OOP access method for our AList above:

It at(AList<It> li, size_t pos)
{
    if (pos >= li.capacity)
        // Out-of-bounds error...

    return li.data[pos];
}

Inserting is trickier, though, because of the possibility of reallocation:

void insert(AList<It> li, It item)
{
    if (li.capacity == li.max_capacity)
    {
        resize_array(li, li.max_capacity * 2);
    }

    data[capacity] = item;
    ++capacity;
}

Our hypothetical resize_array function would then fetch a new block of memory with double the space, copy all the elements over to it, update max_capacity, and change data to point to the new block. Not hard to do, but non-trivial, and the copy can be time-consuming with large array lists. (It runs in what’s called “amortized” O(n). If you don’t cause a reallocation, then inserts are constant-time. If you do, they’re linear, because of the copy.)

Array lists are probably used more often than linked lists as the backing for high-level languages, and they’re the go-to option even for C++, Java, and C#. That’s because you tend to do a lot of insertions into their lists. If the system can allocate a big enough buffer at the start, then inserting into an array list is no different, internally, from inserting into an empty space in a regular array. Deletions are always that easy.

But the linked list approach will come in handy later on, when we look at trees, and they also have a few key advantages. As always, it’s a balance. If you need random access to elements, array lists are better. If you’re doing a lot of inserting at the ends of the structure, and not all at once, linked lists start to become a more attractive option. And if you’re using a high-level language, you’d use whatever is available. It still helps to know what you’re getting into, though. Knowing how the different types of list work can help you plan your own code’s structure better, even if you never have the choice of which kind to use.

Software internals: Classes

Whether you use object-oriented programming or not, you’ve most likely encountered the class and the object. You’re probably coding in a language that has or uses both of them. If you’re not (say you’re using C or something), you still might be using the idea of a class and an object. How they actually work depends on the language, library, or framework in question, but the basic implementation isn’t too different from one to the next.

Objects, no matter what kind you’re using, are data structures. Classes are data types. In many languages, both of these work together to make OOP possible. But you can have one without the other. JavaScript, for instance, has objects, but no classes. (ES6’s class is mere syntactic sugar over a prototype-based implementation.) Since that’s more common than the alternative (classes but no objects), we’ll look at objects first, then add in the bits that classes or their replacements provide.

Objects

Objects, from a language-neutral point of view, are made up of a number of different variables usually called fields. They’re a composite kind of data, like C the struct, except that higher-level languages have a lot more support for doing things with them. But that’s all they really are: a bunch of fields.

That already suggests one way of laying out an object in memory: allocate the fields consecutively. For a struct, you don’t need to do anything else except initialize the block of memory. If one of your fields is itself an object, then that’s okay. You can nest them. You have (rather, the compiler has) the knowledge of what goes where, so it’s not a problem.

The above option works fine in a system where objects are static, where their layouts don’t change, only the contents of their fields. Dynamic languages that let you add and remove object fields need a different approach. One that’s commonly used is a map. Basically, field names (as strings or a special, faster “symbol” type) are associated with chunks of data, and the object is nothing more than a collection of name-value pairs. Using hashes and other tricks (that we may visit in a future post), this can be very fast, though never as fast as direct memory access.

Methods

Methods are functions that just happen to have a special binding to a particular type of object. Different object systems define them in different ways, but that core is always the same. When code calls a method, that method “knows” which object it belongs to, even though it was (usually) defined generically.

Python and object-oriented C libraries like GTK+ make this explicit: every method takes an object as its first parameter. JavaScript takes a different tack, adding methods to the prototype object. Most every other case where methods exist, they’re implicitly made such by their definitions. C++, C#, and Java, for instance, simply let you define functions inside the class, and those are the methods for objects of that class. When they’re called, they receive a “hidden” parameter, a reference to the object they’re called on: this.

As functions can contain arbitrary code, we can’t exactly put them in memory with the fields. One, it kills caching, because you might have kilobytes of method code in between fields. Two, some operating systems have protection systems in place to prevent code and data from intermingling, for very good security reasons.

Instead of having the code and data together, we must separate them. But that’s fine. They don’t need to be mixed in together anyway. However methods are defined, they’ll always have some sort of connection to an object—a pointer or reference, in low-level terms—and they can use that to access its fields. Conversely, the structure of an object can contain function pointers that refer to its methods.

Inheritance

We don’t really start getting into object-oriented programming until we add in inheritance. Coincidentally, here’s where the internals start to become more and more complex. Simple single inheritance lets an object take on parts of a parent object. It can use the parent’s methods as if they were its own, as well as some of the fields. Multiple inheritance is the same thing, but with more than one parent; it can get quite hairy, so most common languages don’t allow it.

The methods don’t really care whether they’re operating on a base or derived class, a parent or a child. For static languages, this is because of the way objects using inheritance are laid out in memory. Broadly speaking, the parent object’s fields come first, and those are followed by the child’s fields. As you go further down the inheritance chain, this means you can always backtrack to the root object. Just by doing that, we get a few things for free. A pointer to an object is the same as a pointer to its parent, for example. (GTK+ makes this explicit: objects are structs, and a child object simply lists its parent as its first field. Standard C memory access does the rest. Problem is, you have to use pointers for this to work, otherwise you get slicing, a mistake every C++ programmer knows all too well.)

Dynamic languages don’t get this benefit, but they all but emulate it. Objects might have a hidden field pointing to the parent object, or they may just copy the parent’s map of names and values into their own as their first act. The latter means extra metadata to keep track of which fields were defined where, but the former is slower for every access of an inherited field. It’s a classic size/speed tradeoff; most languages opt for the faster, but slightly more bloated, map-mixing approach.

For multiple inheritance, well, it’s a lot harder. In dynamic languages, it’s not quite as crazy, but the order of inheritance can make a difference. As an example, take a class C that inherits from classes D and E. If both of those have a field named foo, there’s a problem. C can’t have two foos, but the different base classes might use theirs in different ways. (The only modern static language I know that allows multiple inheritance is C++, and I don’t want to try to explain the memory scheme it uses. I’ll leave that to you to find out.)

Polymorphism

What makes object-oriented programming truly object-oriented is polymorphism. Child classes are allowed to effectively redefine the methods they inherit from their parents, customizing them, but the caller neither knows nor cares about this fact. This is used for abstraction, and it’s not immediately obvious how they do it.

Dynamic languages have a map for their methods, as they do for fields. For them, polymorphism is as easy as changing an entry in the map to refer to a different function…if that’s the way they choose to do it. Another option is only keeping track of the methods directly defined by this object, referring access to all others to the parent class, who might pass it up another level, and so on. For a language using single inheritance, this is linear, and not too bad. With multiple inheritance, method resolution becomes a tree walk, and it can get quite intensive.

Static languages can take advantage of the fixed nature of their classes and reduce polymorphism to a table of function pointers. C++ programmers know this as the v-table (or vtbl), so called because polymorphic methods in that language are prefixed with the keyword virtual; hence, “virtual table”. This table is usually kept in memory somewhere close to the rest of the object, and it will contain, at the very least, a function pointer for each polymorphic method. Those that aren’t overriding a method from a parent class don’t have to be listed, but not every language lets you make that decision.

Construction and destruction

An object’s constructor isn’t necessarily a method. That’s because it also has to do the work of allocating memory for the object, setting up any inheritance-related framing (v-tables, prototypes, whatever), and general bookkeeping. Thus, the constructor doesn’t even have to be connected to the class. It could just as easily be a factory-like function. Destructors are the same way. They aren’t specifically methods, but they’re too bound to a class to be considered free functions. They have to deallocate memory, handle resource cleanup, call parent destructors, and so on.

On the lower levels, constructors and destructors aren’t really part of an object. The object can never call them directly in most languages. (In garbage-collected languages, nobody can call a destructor directly!) Therefore, they don’t need to know where they are. The same is generally true for the C++-specific notions of copy and move constructors. The only wrinkle comes in with inheritance in the case of destructors, and then only in C++, where you can have polymorphic methods but not a polymorphic destructor; this is a bad thing, and it’s a newbie mistake.

Next up

I’ll admit that this post felt a lot smoother than the last two. Next time, we’ll look at another data structure that shows up everywhere, the list. Linked lists, doubly-linked lists, list processing, we’ll see it all. As it turns out, there’s not too much to it. Maybe those Lisp guys were onto something…

Software internals: Strings

A string, as just about any programmer knows, is a bit of text, a sequence of characters. Most languages have some built-in notion of strings, usually as a fundamental data type on par with integers. A few older programming languages, including C, don’t have a separate “string” type, but they still have strings. Even many assemblers allow you to define strings in your assembly language code, though you’re left to deal with them yourself.

The early string

At its heart, a string really isn’t much more than a bunch of characters. It’s a sequence, like an array. Indeed, that’s one way of “making” strings: stuff some characters into an array that’s big enough to hold them. Very old code often did exactly that, especially with strings whose contents were known ahead of time. And there are plenty of places in modern C code where text is read into a buffer—nothing more than an array—before it is turned into a string. (This usually leads to buffer overflows, but that’s not the point.)

Once you actually need to start working with strings, you’ll want something better. Historically, there were two main schools of thought on a “better” way of representing strings. Pascal went with a “length-prefixed” data structure, where an integer representing the number of characters in the string was followed by the contents. For example, "Hi!" as a Pascal string might be listed in memory as the hexadecimal 03 48 69 21. Of course, this necessarily limits the length of a string to 255, the highest possible value of a byte. We could make the length field 16 bits (03 00 48 69 21 on a little-endian x86 system), bringing that to 65535, but at the cost of making every string a byte longer. Today, in the era of terabyte disks and gigs of memory, that’s a fair trade; not so in older times.

But Pascal was very much intended more for education and computer science than for run-of-the-mill software development. On the other side of the fence, C took a different approach: the null-terminated string. C’s strings aren’t their own type, but an array of characters ending with a null (00) byte. Thus, our example in C becomes 48 69 21 00.

Which style of string is better is still debated today, although modern languages typically don’t use a pure form of either of them. Pascal strings have the advantage of easily finding the length (it’s right there!), while C’s strlen has to count characters. C strings also can’t have embedded null bytes, because all the standard functions will assume that the null is only at the end. On the other hand, a few algorithms are easier with null-terminated strings, they can be as long as you like, and they’re faster if you don’t need the length.

In modern times

In today’s languages, the exact format of string doesn’t matter. What you see as the programmer is the interface. Most of the time, that interface is similar to the array, except with a few added functions for comparison and the like. In something like C#, you can’t really make your own string type, nor would you want to. But it’s helpful to know just how these things are implemented, so you’ll know their strengths and weaknesses.

Since everything ultimately has to communicate with something written in C, there’s probably a conversion to a C-style string somewhere in the bowels of any language. That doesn’t mean it’s what the language works with, though. A Pascal-like data structure is perfectly usable internally, and it’s possible to use a “hybrid” approach.

Small strings are a little special, too. As computers have gotten more powerful, and their buses and registers have grown wider, there’s now the possibility that strings of a few characters can be loaded in a single memory access. Some string libraries use this to their advantage, keeping a “small” string in an internal buffer. Once the string becomes bigger than a pointer (8 bytes on a 64-bit system), putting it in dynamic memory is a better deal, space-wise. (Cache concerns can push the threshold of this “small string optimization” up a bit.)

There are also a few algorithms and optimizations that string libraries can use internally to speed things up. “Copy-on-write” means just that: a new copy of a string isn’t created until there’s a change. Otherwise, two variables can point to the same memory location. The string’s contents are the same, so why bother taking up space with exact copies? This also works for “static” strings whose text is fixed; Java, for one, is very aggressive in eliminating duplicates.

UTF?

Nowadays, there’s a big problem treating strings as nothing more than an array of characters. That problem is Unicode. Of course, Unicode is a necessary evil, and it’s a whole lot better than the mess of mutually incompatible solutions for international text that we used to have. (“Used to”? Ha!) But Unicode makes string handling exponentially harder, particularly for C-style strings, because it breaks a fundamental assumption: one byte equals one character.

Since the world’s scripts together have far more than 255 characters (the most a byte can distinguish), we have to do something. So we have two options. One is a fixed-size encoding, where each character—or code point—takes the same amount of space. Basically, it’s ASCII extended to more bits per character. UTF-32 does this, at the huge expense of making every code point 4 bytes. Under this scheme, any plain ASCII string is inflated to four times its original size.

The alternative is variable-length encoding, as in UTF-8. Here, part of the “space” in the storage unit (byte for UTF-8, 2 bytes for UTF-16) is reserved to mark a “continuation”. For example, the character ë has the Unicode code point U+00EB. In UTF-8, that becomes C3 AB. The simple fact of the first byte being greater than 7F (decimal 127) marks this as a non-ASCII character, and the other bits determine how many “extended” bytes we need. In UTF-32, by contrast, ë comes out as 000000EB, twice as big.

The rules for handling Unicode strings are complex and unintuitive. Once you add in combining diacritics, the variety of spaces, and all the other esoterica, Unicode becomes far harder than you can imagine. And users of high-level, strings-are-black-boxes languages aren’t immune. JavaScript, for instance, uses UCS-2, a 16-bit fixed-width encoding. Until very recently, if you wanted to work with “high plane” characters—including emoji—you had some tough times ahead. So there’s still the possibility, in 2016, that you might need to know the internals of how strings work.

Software internals: Arrays

I’m a programmer. I think that should be obvious. Even though most of my coding these days is done at a pretty high level, I still enjoy the intricacies of low-level programs. It’s great that we have ways to hide the complexities inherent in software, but sometimes it’s fun to peel back the layers and look at just how it works. That’s the idea behind this little series of posts. I want to go into that deeper level, closer to the metal. First up, we’ll take a peek at that simplest and dumbest of data structures: the array.

Array, array

At the basic level, an array is nothing more than a sequence of values. Different languages have different rules, especially regarding types, but the idea is always the same. The values in an array are its elements, and they can be identified by an index. In C, for example, a[0] has the index 0, and it refers to the first element of the array named a. (C-style languages start counting from 0. We’ll see why shortly.)

For the purposes of this post, we’ll start with the simplest kind of array, a one-dimensional array whose values are all of the same type—integers, specifically. Later, we can expand on this, but it’s best to start small. Namely, we’ll have an array a with four values: {1, 2, 3, 4}. Also, we’ll mainly be using lower-level languages like C and C++, since they give the best look at how the code really runs.

In memory

One of the main reasons to use something like C++ is because of memory concerns, so let’s look at how such an array is set up in memory. On my PC, using 64-bit Debian Linux and GCC 5.3, it’s about as simple as can be. The compiler knows all the values beforehand, so all it does is put them in a “read-only data” section of the final executable. (In the assembly output, this shows up as .long statements in the .rodata section.) The elements of the array are in contiguous locations; that’s not just required by the C standard, but by the very definition of an array. It also makes them fast, especially when cache comes into play.

In C++, 4 integers in an array take up the space of, well, 4 integers. On a 64-bit system, that’s 32 bytes, half that if you’re still on 32-bit. There’s no overhead, because an array at this level is literally nothing more than a sequence of memory locations.

That contiguous layout makes working with the array trivial. Given an array a or n-byte elements, the first element—index 0—is at the same address as the array itself (&(a[0]) == &a in C parlance). To find any other one, all you have to do is multiply the index by the size of each element: &(a[i]) == &a + i * sizeof(int). Addition is just about the fastest thing a processor does, and word sizes as powers of 2 mean that the multiplication is nothing more than a bit shift, so array indexing is hard to beat.

Copying these arrays is easy, too: copy each element, and you’re done. Want to compare them? Nothing more than going down the line, looking for differences. Sure, that takes linear—O(n)—time, but it’s a great start. Of course, there are downsides, too. Arrays like this are fixed in size, and they all have to be the same type.

Complicating matters

There’s not much more to be said for the humble array, so let’s add some kinks. To start, what do you do if you don’t know all the values to begin with? Then, you need an uninitialized array, or a buffer. Compilers typically use a trick called a BSS segment to make these, while higher-level languages tend to initialize everything to a null value. Either way, all you really get is a block of memory that you’re expected to fill in later.

Changing the other assumptions of the array (fixed size and type) means changing the whole structure. Dynamically-sized arrays, like C++’s vector, need a different way of doing things. Usually, this means something like having an internal array—with a bit of room to grow—and some bookkeeping data. That gets into dynamic memory allocation, another topic for later, but from the programmer’s point of view, they work the same way. In fact, vector is required to be a drop-in replacement for arrays. (If you want arrays where elements can be of different types, like in JavaScript, then you have to abandon the simple mathematical formula and its blazing speed. At that point, you’re better off ditching the “array” concept completely.)

Moving up to higher levels doesn’t really change how an array functions. At its core, it’s still a sequence of values. One of JavaScript’s newer features is the typed array, which is exactly that. It’s intended to be used where speed is of the essence, and it’s little more than a friendlier layer on top of C-style arrays.

Implementation details

Just about every usable language already has something like an array, so there’s almost never a need to make one yourself. Indeed, it’s nearly impossible to do so. But maybe you’re working in assembly language. There, you don’t have the luxury.

Fixed-size arrays are nothing more than blocks of memory. If your array has n elements, and each one is size s, then you need to reserve n * s bytes of memory. That’s it. There’s your array. If you need it initialized, then fill it in as necessary.

Element access uses the formula from above. You need to know the address a of the array, the element size s, and the index i. Then, accessing an element is nothing more than loading the value at a + i * s. Note, though, that this means elements are numbered starting at 0. (And it’s exactly why, for that matter.)

Since arrays are dumb, you can pass them around as blocks of memory, but you always need to know their size. If you’re not careful, you can easily get buffer overflows and other out-of-bounds conditions. That’s the reason why so many “safe” C functions like snprintf take an extra “maximum size” argument. The array-as-memory-block notion, incidentally, is why C lets you treat pointers and arrays as the same thing.

The end

The array, in whatever form, is the most basic of data structures, so it made for a good starting point. I hope it set the tone for this series. Later on, I’ll get into more complicated structures and algorithms, like linked lists, sorting, and so on. It’s all stuff that programmers in something like JavaScript never need to worry about, but it’s fun to peek under the hood, isn’t it?