Software internals: Strings

A string, as just about any programmer knows, is a bit of text, a sequence of characters. Most languages have some built-in notion of strings, usually as a fundamental data type on par with integers. A few older programming languages, including C, don’t have a separate “string” type, but they still have strings. Even many assemblers allow you to define strings in your assembly language code, though you’re left to deal with them yourself.

The early string

At its heart, a string really isn’t much more than a bunch of characters. It’s a sequence, like an array. Indeed, that’s one way of “making” strings: stuff some characters into an array that’s big enough to hold them. Very old code often did exactly that, especially with strings whose contents were known ahead of time. And there are plenty of places in modern C code where text is read into a buffer—nothing more than an array—before it is turned into a string. (This usually leads to buffer overflows, but that’s not the point.)

Once you actually need to start working with strings, you’ll want something better. Historically, there were two main schools of thought on a “better” way of representing strings. Pascal went with a “length-prefixed” data structure, where an integer representing the number of characters in the string was followed by the contents. For example, "Hi!" as a Pascal string might be listed in memory as the hexadecimal 03 48 69 21. Of course, this necessarily limits the length of a string to 255, the highest possible value of a byte. We could make the length field 16 bits (03 00 48 69 21 on a little-endian x86 system), bringing that to 65535, but at the cost of making every string a byte longer. Today, in the era of terabyte disks and gigs of memory, that’s a fair trade; not so in older times.

But Pascal was very much intended more for education and computer science than for run-of-the-mill software development. On the other side of the fence, C took a different approach: the null-terminated string. C’s strings aren’t their own type, but an array of characters ending with a null (00) byte. Thus, our example in C becomes 48 69 21 00.

Which style of string is better is still debated today, although modern languages typically don’t use a pure form of either of them. Pascal strings have the advantage of easily finding the length (it’s right there!), while C’s strlen has to count characters. C strings also can’t have embedded null bytes, because all the standard functions will assume that the null is only at the end. On the other hand, a few algorithms are easier with null-terminated strings, they can be as long as you like, and they’re faster if you don’t need the length.

In modern times

In today’s languages, the exact format of string doesn’t matter. What you see as the programmer is the interface. Most of the time, that interface is similar to the array, except with a few added functions for comparison and the like. In something like C#, you can’t really make your own string type, nor would you want to. But it’s helpful to know just how these things are implemented, so you’ll know their strengths and weaknesses.

Since everything ultimately has to communicate with something written in C, there’s probably a conversion to a C-style string somewhere in the bowels of any language. That doesn’t mean it’s what the language works with, though. A Pascal-like data structure is perfectly usable internally, and it’s possible to use a “hybrid” approach.

Small strings are a little special, too. As computers have gotten more powerful, and their buses and registers have grown wider, there’s now the possibility that strings of a few characters can be loaded in a single memory access. Some string libraries use this to their advantage, keeping a “small” string in an internal buffer. Once the string becomes bigger than a pointer (8 bytes on a 64-bit system), putting it in dynamic memory is a better deal, space-wise. (Cache concerns can push the threshold of this “small string optimization” up a bit.)

There are also a few algorithms and optimizations that string libraries can use internally to speed things up. “Copy-on-write” means just that: a new copy of a string isn’t created until there’s a change. Otherwise, two variables can point to the same memory location. The string’s contents are the same, so why bother taking up space with exact copies? This also works for “static” strings whose text is fixed; Java, for one, is very aggressive in eliminating duplicates.

UTF?

Nowadays, there’s a big problem treating strings as nothing more than an array of characters. That problem is Unicode. Of course, Unicode is a necessary evil, and it’s a whole lot better than the mess of mutually incompatible solutions for international text that we used to have. (“Used to”? Ha!) But Unicode makes string handling exponentially harder, particularly for C-style strings, because it breaks a fundamental assumption: one byte equals one character.

Since the world’s scripts together have far more than 255 characters (the most a byte can distinguish), we have to do something. So we have two options. One is a fixed-size encoding, where each character—or code point—takes the same amount of space. Basically, it’s ASCII extended to more bits per character. UTF-32 does this, at the huge expense of making every code point 4 bytes. Under this scheme, any plain ASCII string is inflated to four times its original size.

The alternative is variable-length encoding, as in UTF-8. Here, part of the “space” in the storage unit (byte for UTF-8, 2 bytes for UTF-16) is reserved to mark a “continuation”. For example, the character ë has the Unicode code point U+00EB. In UTF-8, that becomes C3 AB. The simple fact of the first byte being greater than 7F (decimal 127) marks this as a non-ASCII character, and the other bits determine how many “extended” bytes we need. In UTF-32, by contrast, ë comes out as 000000EB, twice as big.

The rules for handling Unicode strings are complex and unintuitive. Once you add in combining diacritics, the variety of spaces, and all the other esoterica, Unicode becomes far harder than you can imagine. And users of high-level, strings-are-black-boxes languages aren’t immune. JavaScript, for instance, uses UCS-2, a 16-bit fixed-width encoding. Until very recently, if you wanted to work with “high plane” characters—including emoji—you had some tough times ahead. So there’s still the possibility, in 2016, that you might need to know the internals of how strings work.

Leave a Reply

Your email address will not be published. Required fields are marked *