Software internals: Trees

We’ve talked quite a bit about data structures in this series, although not as much in recent entries. We’ve seen lists and arrays, two of the most important structures around, but they both have a severe limitation: they’re linear. While that’s great for a lot of data that computer programs manipulate, there comes a time when you need to represent both the data and its organization, and that calls for something more substantial.

The tree is a data structure that represents a hierarchy. Like a real tree, it has a root. It has branches and leaves. And a collection of data trees is also called a “forest”. But those similarities are a bit artificial; the terms were chosen for the analogy. What these trees do isn’t much like the natural ones. To a programmer, however, they’re far more useful.

In the abstract

A tree starts with a root. This is a data node, like one in a linked list, and it can hold one value of any type. If you’ll recall, nodes in linked lists also carry along a pointer to the following node (and the preceding one, for a doubly-linked list). Well, trees do something like that, too. The root node, in addition to its data value, also contains a list of pointers to its children.

These children are root nodes, and thus trees in their own right. Each will have its own data and list of children, and those, in turn, will have the same. If the child list is empty, then that node is at the “end” of one branch; it’s a leaf node. (If there are children, you might be wondering, are there parents? Yes. A node having another node in its child list is that second node’s parent. But data trees are asexual: each child only has one parent.)

Trees are a perfect example of recursion. A tree’s children are also trees—empty ones, if they’re leaf nodes—and they can be treated as such. The most used algorithms lend themselves naturally to a recursive style even in imperative languages. So that’s how we’ll be looking at them here.

Growing a tree

Trees consist of nodes, and each node has two main parts: a value and a list of children. Thus, a simple way of representing a tree is as an object (or similar structure) with two fields, like so. (This example uses TypeScript for illustration purposes.)

class TreeNode {
    value: any;
    children: TreeNode[];
}

Not that hard, huh? It’s nothing more than putting our words into code. A TreeNode‘s value could be anything, though you’d probably restrict it in a real application, and its children are more TreeNodes. Adding, removing, and altering children works in almost the same way as with linked lists, at least with this “basic” tree setup.

Traversing the tree (sometimes called walking it) is a different matter. Here, we want to go through each of the root’s children, then their children, and so on, until we’ve visited every node in the tree. There are two main ways to do this, but the depth-first method is more common and easier to explain. In code, it looks about like this:

function traverse(tree, callback) {
    for (var child of tree.children) {
        traverse(child, callback);
        callback(child);
    }
}

This algorithm is “depth-first” because it visits each of a node’s children before moving to its siblings. This is where the recursion comes into play. We loop through each of the children of the root node. They, in turn, become the roots of a new depth-first traversal. Only when we’ve exhausted the “depth” of the tree—when we reach a leaf node—do we start doing anything. (We’re assuming here that our tree doesn’t have any cycles, child nodes that point to higher levels of the tree. Those make things much more difficult.)

Now, there are a lot of things you can do with this function, but I’ve limited it to a simple, nebulous callback that is run for each node. In the abstract, that’s all there is to it. That simple block of code effectively describes the operation of some code generators, AI systems, and many other complex actions.

Special kinds of trees

The general tree described above suffices for a great many applications. Sometimes, though, it’s more efficient to be more restricted. Computer scientists have therefore developed a number of specializations of the tree for different uses.

The binary tree is probably the most well-known of these. It’s called “binary” because each node has exactly two children: left and right. These children are themselves trees, of course, but empty trees are often represented as null values. Here’s what it looks like in code, once again using TypeScript:

class BinaryTree {
    data: any;
    left: BinaryTree;
    right: BinaryTree;
}

Binary trees add an extra wrinkle to the depth-first traversal we say earlier. Now, we have three possible ways to do things: pre-order, in-order, and post-order traversal. The only thing that changes is when we process the “current” node.

function preOrder(node, callback) {
    if (node == null) return;

    callback(node.data);
    preOrder(node.left, callback);
    preOrder(node.right, callback);
}

function inOrder(node, callback) {
    if (node == null) return;

    inOrder(node.left, callback);
    callback(node.data);
    inOrder(node.right, callback);
}

function postOrder(node, callback) {
    if (node == null) return;

    postOrder(node.left, callback);
    postOrder(node.right, callback);
    callback(node.data);
}

Each approach has its pros and cons. Pre-order works for copying trees, and it’s essentially how expressions are represented in abstract syntax trees or ASTs, an important part of compilers and interpreters. In-order traversal is best for sorted trees, where an extra “key” value is added to each node, arranged so that it’s greater than all the keys in its left subtree but smaller than all those in the right; these binary search trees are used for set structures, lookup tables, and the like. (Maybe we’ll look at them in more detail later in the series.) Finally, post-order is the option of choice for deleting a node’s children and for making RPN expressions.

The binary tree also has a few of its own alterations. The red-black tree uses a “color bit” to keep the tree balanced. This is helpful because most algorithms working on binary trees work best when there the total numbers of left and right nodes are about equal. (In the worst case, all of a tree’s valid descendants are on one “side”. Then, you’re left with just a more expensive linked list.) The AVL tree is a similar variation that has its own pluses and minuses.

B-trees are a special case that sits between binary trees and the general category of trees. They’re allowed to have more than two children, but only up to a certain number. The trick is that each node is given a number of sort keys (like the one in a binary search tree), one for every child past the first. These are calculated so they fit “between” the children, allowing them to serve as index values, speeding up traversal. B-trees are mostly used in databases and filesystems—the two are practically the same thing these days—including the Windows filesystem of choice, NTFS, Apple’s HFS+, and Ext4 and Btrfs on Linux.

Seeing the forest

Trees are everywhere in programming. Whenever you have a set of data with obvious parent-child relationships, it’s probably going to be represented in tree form. Programming languages often don’t offer direct access to the trees, but they’re in there. Sets, maps, and hash tables all use them under the hood. Any kind of parsing, such as XML or a game’s custom scripting language, is going to be tree-based. Even if you never see them, you’ll be using them somewhere.

Fortunately, if you’re not dealing with the innards of the structures themselves, there’s not much you need to worry about. All the hard work has been taken care of. No one needs to manually balance a binary search tree, for instance. But if you work with them directly, it helps to know how they work.

Leave a Reply

Your email address will not be published. Required fields are marked *