Software internals: Sorting

We’ve looked at quite a few data structures in this series, from simple arrays to objects. Now, let’s turn our focus to algorithms. One class of algorithms is very important to high-level programming: the sorting algorithms. For decades, computer scientists have been developing new ways to sort data, trying to balance the needs of size and speed, and there’s no silver bullet. Some sorting algorithms are good for general use, but they fall apart on certain degenerate cases. Others are bad all around, but they’re interesting from a teaching perspective.

In this post, we’ll look at three specific algorithms that illustrate the evolution of sorting and some of its trade-offs. Mostly, we’ll treat each one as working on a simple array of integers that we want sorted in increasing order, but it’s easy enough to generalize the sort function to whatever you need. Also, the code will all be in JavaScript, but I won’t be using any trickery or obfuscation, so it shouldn’t be too hard to convert to your favorite language. (Of course, most languages—JavaScript included—already have sorting functionality built into their standard libraries, so it’s better to use those than to write your own.)

Simple and stupid

Bubble sort is the first sorting algorithm many coders see. It’s stupidly simple: move through a list, comparing each element to the one before it. If they’re out of order, swap them. When you get to the end, start over, and repeat until the list is sorted. In code, it might look like this:

function bubblesort(arr) {
    var len = arr.length;

    do {
        var swapped = false;

        // run through the list
        // compare each element to the last
        // swap those that are out of order
        for (var i = 1; i < len; i++) {
            if (arr[i] < arr[i-1]) {
                // swap elements
                var temp = arr[i];
                arr[i] = arr[i-1];
                arr[i-1] = temp;
                swapped = true;
            }
        }

        // optimization (see below)
        len--;

        // repeat until everything's sorted
    } while (swapped);
}

Even this is already optimized a bit from the basic bubble sort method. Because of the way higher values “bubble up” to the end of the list (hence the name “bubble sort”), we know that, after n iterations, the last n values will be sorted. Thus, the line len-- tells the loop to ignore those values.

That optimization doesn’t help the overall performance of bubble sort. It remains O(n²), not very good. But bubble sort’s value is in its simplicity. You wouldn’t use it in the real world, but it’s a good way to show how sorting works. It’s easy to follow along with it. Take a handful of small numbers and try sorting them by hand, maybe even on paper. It’s not that hard.

Quick and dirty

For a few decades, the go-to choice for sorting has been quicksort. It’s the reason that C’s standard sorting function is called qsort, and it’s still used in many languages, both for performance and simplicity. (Haskell lovers will be quick to point out that that language can implement quicksort in two lines of code.)

Quicksort works by a recursive approach that boils down to “divide and conquer”. Imagine a list as a line of values. Now pick one of those values as the pivot. Take everything less than the pivot and sort it, then take everything greater, and sort that. (This is the recursive step.) Your sorted list is everything in the first sublist, then all the values equal to the pivot, then the second sublist. Or, in code:

function quicksort(arr) {
    if (arr.length <= 1) {
        // recursion base case
        return arr;
    }

    // find the pivot value
    var pivotIndex = parseInt(arr.length / 2);
    var pivotValue = arr[pivotIndex];

    // these will hold our sublists
    var left = [], right = [], pivots = [];

    // partition the list into three:
    // 1. less than the pivot
    // 2. greater than the pivot
    // 3. equal to the pivot
    for (var i = 0; i < arr.length; i++) {
        if (arr[i] < pivotValue) {
            left.push(arr[i]);
        } else if (arr[i] > pivotValue) {
            right.push(arr[i]);
        } else {
            pivots.push(arr[i]);
        }
    }

    // the sorted list is (left + pivots + right)
    return quicksort(left).concat(pivots).concat(quicksort(right));
}

That’s essentially an expanded version of the Haskell two-liner. It’s not the best from a memory or speed standpoint, but it works to show you the way the algorithm works. Another way works in-place, directly operating on the array by swapping elements around so that all the values less than the pivot are placed before it, then putting those greater after it, and then recursing on the resulting partially-sorted list. That one is a lot faster, but it’s a bit harder to grasp. It also needs either a helper function or a bit of logic to allow both sorting of an entire list and of a portion of one.

The gains from that added complexity are huge, though. With it, quicksort becomes one of the faster sorting methods around, and its space efficiency (with the in-place version) can’t be beat. That’s why quicksort remains so popular, even despite some well-known shortcomings. It’s good enough for most purposes.

Good all around

The last of the “simple” sorts we’ll look at is merge sort. This one is very much like quicksort in that it uses a strategy of repeatedly subdividing a list, but it works without a pivot element. At each step, it breaks the list in half and sorts each half separately. (A list with only one element is sorted by definition, so that’s the stopping point.) Then, it merges those halves an element at a time. Here’s the code:

function mergesort(arr) {
    // An element with 0-1 elements is always sorted
    if (arr.length < 2) {
        return arr;
    }

    // break the list into halves
    var middle = arr.length / 2;
    var left = arr.slice(0, middle);
    var right = arr.slice(middle);

    // sort each half separately
    left = mergesort(left);
    right = mergesort(right);

    // now merge the halves
    // take the 1st element of each array, and compare
    // the lower one moves into the "result" list
    // repeat until there's nothing left
    var result = [];
    while (left.length && right.length)
    {
        if (left[0] <= right[0])
        {
            result.push(left.shift());
        } else {
            result.push(right.shift());
        }
    }
    // add in anything we didn't get
    // (in case we had uneven lists)
    result = result.concat(left).concat(right);

    return result;
}

Merge sort uses more space than a good quicksort, but it’s generally faster, and it’s especially good for sorting linked lists. It isn’t the fastest overall, however, nor is it the best for tight memory situations. But it’s a happy medium, all else being equal. Not that it ever is.

Never enough

These aren’t the only sorting algorithms around. They’re merely the easiest to explain. Heapsort is another good one that gets used a lot. Radix sort works best with a lot of data that can be easily indexed. And there are many more than that. There’s even a site that has more details and graphical visualizations of many of the algorithms, including those we’ve discussed in this post.

Of course, it’s very likely that you’ll never need to implement a sorting algorithm yourself. Almost every language already has dozens of them implemented either in the standard library or in open 3rd-party code. Most are even fine-tuned for you. But this whole series is about peeling back those layers to see how things work on the inside. Calling array.sort() is easy, but there are times when it might not be the best option. Learning the algorithms—how they work, their advantages and disadvantages—can give you the knowledge to find those spots when the “standard” sort is a bad idea. That’s a rare case, but finding it just once pays for itself.

Next up, we’ll look at the other side of sorting: searching. With Google and “Big Data”, searching has become more important than ever, so stay tuned.

One thought on “Software internals: Sorting”

Leave a Reply

Your email address will not be published. Required fields are marked *