Software internals: Lists

Up to now, we’ve been looking at the simplest of data structures. Arrays are perfectly linear. Strings are mostly so, until you add in the complications of modern character sets. Objects need a little bit more help, but they can often be reduced to arrays of data. Not so this time around.

Lists are incredibly important to a lot of programming work. In today’s popular languages, they’ve essentially taken over the fundamental position once occupied by arrays. In Python, Ruby, and so many other languages, it’s the list that is the main sequence. In Lisp and Clojure, it’s the universal one, in the same way that NAND is the universal gate in electronics: everything else can be built from it.

But why are lists so useful? Unlike arrays, they lend themselves to dynamic allocation. Elements packed into an array don’t have anywhere to go. If you want to resize the structure, you often have to move everything around. In C++, for example, a vector (essentially a dynamically-sized array) can have elements added and removed, but that can trigger a full resizing, which can move the vector to a different address in memory, thus rendering all your pointers and iterators useless. Higher-level languages don’t have these problems with arrays, but lists never have them in the first place. They’re made to be changed.

Linked lists

The most common method for making a list is the linked list. At its core, a linked list is nothing but a sequence of nodes. A node is typically defined as a structure with a data member (one piece of the data you’re storing in the list) and a link member. This will usually be a pointer to the next node in the list, or, for the last item, a null pointer. In code, it might look like this:

// It is "item type"
struct Node<It>
{
    It data;
    Node<It>* next;
};

A list, then, could be represented as nothing more than a pointer to the first Node. By “walking” through each next pointer, a function can visit each node in order. And that’s all there really is to it.

Working with this kind of linked list isn’t too hard. Finding its length, a common operation in code, doesn’t take much:

// This is assumed throughout, but we'll make it explicit here
using List<It> = Node<It>*;

size_t length(List<It> li)
{
    size_t counter = 0;

    while (li)
    {
        ++counter;
        li = li->next;
    }

    return counter;
}

There are plenty of optimizations we can do to improve on this (it’s O(n)), but it should illustrate the basic idea. If you prefer a functional approach, you can do that, too:

// FP version
size_t length(List<It> li)
{
    if (!li)
        return 0;
    else
        return 1 + length(li->next);
}

That one looks a lot better in a proper FP language, but I wanted to stick to a single language for this post.

Inserting new elements is just manipulating pointers, and changing their value can be done by altering the data members. In general, that’s all you need to know. Even higher level languages are largely based on this same foundation, no matter what interface their lists present to the programmer.

More links

But this singly-linked list can be a bit cumbersome. Almost every operation involves walking the list from the beginning, and there’s no real way to get to an earlier element. That very need naturally leads to the creation of the doubly-linked list.

Here, each element has two pointers: one to the next element, the other to the previous one. It’s anchored on both sides by null links, and it’s otherwise the same principle as the singly-linked version, with the only downside being a slight increase in memory use. In code, such a structure might look like this one:

struct DNode<It>
{
    It data;
    DNode<It>* prev;
    DNode<It>* next;
}

Code that doesn’t care about going “backwards” can ignore the prev pointer, meaning our length function from earlier works with doubly-linked lists, too. (We’d need to change the argument type, of course.) Now, though, we get to reap the rewards of having two links. We no longer need to worry about getting a pointer to the beginning of the list, for one; any pointer to a list element can now be used to find its start.

Doubly-linked lists are so much more useful than singly-linked ones that they’re really the default in modern times. The C++ standard library had only doubly-linked lists until 2011 brought slist, for instance. And high-level languages usually don’t give you a choice. If they use a linked list, it’ll have (at least) two links per node.

Another option

The drawback of linked lists is the time it takes to find anything in them. Most operations are going to require you to walk some, if not all, of the list. The bigger the list gets, the longer it takes to walk it. In other words, it’s O(n) time.

Different systems get around this in different ways. One possibility is to forgo the use of linked lists entirely, instead basing things around an array list. This is nothing more than a dynamically-sized array like the C++ vector or Java ArrayList, but it can be used like a list, except that it also has a random access interface.

Most of the time, an array list will have a reference to its own block of memory, enough to hold all its current elements plus a few extra. When it runs out of that space, it’ll allocate a new, larger buffer, and move everything to that. On the inside, it would look something like:

struct AList<It>
{
    size_t capacity;
    size_t max_capacity;
    It* data;
}

Array lists work a lot better in an object-oriented system, because you can use methods to simplify the interface, but there’s no reason you need them. Here, for example, is a non-OOP access method for our AList above:

It at(AList<It> li, size_t pos)
{
    if (pos >= li.capacity)
        // Out-of-bounds error...

    return li.data[pos];
}

Inserting is trickier, though, because of the possibility of reallocation:

void insert(AList<It> li, It item)
{
    if (li.capacity == li.max_capacity)
    {
        resize_array(li, li.max_capacity * 2);
    }

    data[capacity] = item;
    ++capacity;
}

Our hypothetical resize_array function would then fetch a new block of memory with double the space, copy all the elements over to it, update max_capacity, and change data to point to the new block. Not hard to do, but non-trivial, and the copy can be time-consuming with large array lists. (It runs in what’s called “amortized” O(n). If you don’t cause a reallocation, then inserts are constant-time. If you do, they’re linear, because of the copy.)

Array lists are probably used more often than linked lists as the backing for high-level languages, and they’re the go-to option even for C++, Java, and C#. That’s because you tend to do a lot of insertions into their lists. If the system can allocate a big enough buffer at the start, then inserting into an array list is no different, internally, from inserting into an empty space in a regular array. Deletions are always that easy.

But the linked list approach will come in handy later on, when we look at trees, and they also have a few key advantages. As always, it’s a balance. If you need random access to elements, array lists are better. If you’re doing a lot of inserting at the ends of the structure, and not all at once, linked lists start to become a more attractive option. And if you’re using a high-level language, you’d use whatever is available. It still helps to know what you’re getting into, though. Knowing how the different types of list work can help you plan your own code’s structure better, even if you never have the choice of which kind to use.

Leave a Reply

Your email address will not be published. Required fields are marked *