Software internals: Searching

Big Data is all the rage today, but what good is all that data if you can’t find what you’re looking for? Google, among others, has made billions of dollars by using and developing search algorithms, the backbone of analytics. While I couldn’t even begin to show you how their algorithms work (nor would you likely ever need to know), we can look at their precursors, the simple searches common to many programming languages.

Brute force

The simplest way to search for an item in a list (a value in an array, a substring in a larger body of text, or whatever) is to loop through the list and check each value to see if it’s the one you want. This is a linear search, and it’s not hard to implement, even in JavaScript:

function linear_search(array, value) {
    for (var i = 0; i < array.length; i++) {
        if (array[i] == value) {
            return i;
        }
    }

    return -1;
}

There’s more than one way to do it, but that’s the most straightforward. It’s also the “classic” form familiar to programmers of C and related languages. But it comes at a price: execution time. Linear searches are O(n); the time they take is dependent on the number of items you’re searching over. (Statistically speaking, a linear search of a list of size N will, on average, take N/2 iterations.) Surely we can do better, right?

Divide and conquer

If our list is sorted, we can take an approach that you might recognize from “higher or lower” number guessing games. The binary search starts at a point somewhere in the middle of a list and checks to see if it’s at the right spot. If it’s not, then—because the list is sorted—it knows which way to go for the next guess. It looks a bit like this:

function binary_search(array, value) {
    var idmin = 0;
    var idmax = array.length - 1;

    while (idmin < idmax) {
        // Note: older versions of JS use Math.floor() here
        var midpoint = idmin + Math.trunc((idmax - idmin) / 2);

        if (array[midpoint] == value) {
            return midpoint;
        }
        else if (array[midpoint] < value) {
            idmin = midpoint + 1;
        }
        else {
            idmax = midpoint - 1;
        }
    }

    return -1;
}

Again, there are other ways of writing this function, but the one shown here works for our purposes.

The binary search requires a sorted array. That’s how it’s able to deduce where in the list, relative to its pointer, the correct value is. If you ignore the sorting step (probably because you were going to do it anyway), then it comes out to O(log n) time. That’s faster than a simple linear search, and good enough to put a version of the algorithm into the standard libraries of C (bsearch()), C++ (std::binary_search()), and Java (Arrays.binarySearch()), among others.

Substrings

Searching strings for a specific substring is its own problem with its own considerations. If the substring you’re looking for is only a single character, then the whole exercise degenerates into a case of the above, and you can use linear search (as C’s strchr() likely does behind the scenes). But searching for multi-character strings is a bit harder. A naive approach might look like this:

function substring_search(text, ss) {
    var initial = ss[0];
    var found = false;

    for (var i = 0; i < text.length - (ss.length - 1); i++) {
        if (text[i] == initial) {
            found = true;
            for (var j = 1; j < ss.length, found == true; j++) {
                if (text[i+j] != ss[j]) {
                    found = false;
                }
            }

            if (found) {
                return i;
            }
        }
    }

    return -1;
}

Basically, this is a linear search for the first character of the search string, but with a twist. Whenever that first character is found, an inner loop runs, checking the following characters in the text to see if they match as well. If they do, then we’ve got a perfect match. Otherwise, we keep going from where we left off.

This “dumb” approach takes O(n+m) time, but it’s hard to improve on it without creating some more complicated data structures. Fortunately, that’s exactly where the series is going next, so stay tuned.

Leave a Reply

Your email address will not be published. Required fields are marked *