Assembly: the standard machine

This crazy thing we call the PC has been around for over 30 years now, and it’s quite far from its original roots as IBM’s business-oriented machine. The numbers involved are mind-boggling in their differences. The first PC ran at 4.77 MHz, while today’s can top 4 GHz. (And, thanks to parallelization, cache, and all those other advances we’ve seen, that adds up to far more than a 1000x increase in computing power.) Intel’s 8086 processor could address 1 MB of memory; modern computers have multiple megabytes inside them. Back then, the primary medium for storage was the floppy disk, and the PC’s disks weighed in at a hefty 360 KB; nowadays, you can find USB flash drives capable of holding more than a million times that.

And people still managed to get things done with machines that are easily outdone by our watches, our refrigerators, our thermostats. A lot of people today can’t even comprehend that, including many—I know quite a few—who used to get by without computers at all! But histories of the PC are everywhere, and we have a specific reason for our trek down memory lane. So let’s take a look at these ancient contraptions from the assembly programmer’s point of view.

The ideal PC

Since we’re using an emulator for our programming, we don’t need to go into a lot of hardware specifics. It doesn’t particularly matter for the purposes of this series that the 8088 had only an 8-bit bus, or that the 486SX lacked a built-in floating-point unit. (I had one of those on my first PC. It sucked.) From our level, those details are mostly irrelevant. What does matter is, for lack of a better term, the hardware interface.

For this series, I’m assuming an “ideal” PC loosely based on real ones. It’s not an original PC, nor is it an XT, AT, or something later. It’s really PC compatible, but nothing more. It roughly conforms to the specs, though some things might be different. That’s intentional, and it’s for ease of explanation. (It also means I can reuse knowledge from all those old DOS tutorials.)

Our idealized machine will use a 286-level processor running in real mode. That lets us focus on a simplified architecture (no protected mode, no 32-bit stuff) that isn’t too far off from what people were using in the late 80s. It does mean limitations, but we’ll work around them.

We’ll also assume that we have access to VGA graphics. That was not the case for the early 286-based PCs; they had to make do with EGA and its 16 colors, at best. Again, this makes things easier, and you’re free to look up the “real thing” if you wish.

Finally, we’re using FreeDOS in the v86 emulator until further notice. It’s not exactly Microsoft’s old DOS from the days of yore, but it’s close enough for this series. Basically, assume it has the features a DOS is supposed to have.

Now that that’s out of the way, let’s see what we have to work with.

Memory

We effectively have a 286 processor, which we looked at last week. Since we’re ignoring protected mode entirely, that means we have access to a total of 1 MB of memory. (Note that this is a real megabyte, not the million bytes that hard drive manufacturers would have you believe.) That memory, however, is divided up into various portions.

  • The first 1 KB is reserved by the CPU as the interrupt vector table. The x86 allows 256 different interrupts, whether caused by hardware or software. This table, then, holds 256 4-byte addresses, each one a pointer to an interrupt handler. For example, the core DOS interrupt, 21h, causes execution to jump to the interrupt routine whose address is stored at 0x21 * 4 = 0x84, or 0000:0084.

  • The next 256 bytes, starting at 0040:0000, are the BIOS data area. (I’ll explain the BIOS in a bit.) Much of this space has special meaning for the BIOS, so it’s a bit like the zero page on most 6502 computers.

  • DOS, the BIOS, and the PC all use bits of the next 256 bytes, starting at 0050:0000.

  • DOS effectively begins loading at 0060:0000, though how much memory it uses from here depends on a wide variety of factors.

Whatever is left after this is available for program use, up to the PC’s RAM limit. Everything else in the address space, starting at A000:0000, is given to the various types of adapters supported by the architecture. Even on later systems with expanded memory capabilities, this conventional memory limit of 640 KB remained. Yes, the original PC was limited to 640K of RAM—leading to the famous quote Bill Gates may or may not have uttered—but this wasn’t as stringent as it seems; the first IBM models only had 256K, and that was still four times what the 6502 could understand.

The remaining 384 KB of the old PC’s range was for the display, the BIOS, and any expansions that may have been around:

  • A 64K block starting at a000:0000 is the video buffer for color graphics modes, including the VGA-alike our idealized PC uses. (Modern graphics cards use high-speed direct memory accesses, but this area is still there for “fallback” mode.)

  • The next 32K, from b000:0000, is used by monochrome video adapters for text. Some systems that neither had nor supported monochrome used this space as extra memory.

  • Another 32K block, starting with b800:0000 or b000:8000 (they’re the same thing), is a very big deal. It’s where the color text-mode memory lies, so manipulating this area will directly affect the contents of the screen. It’s a linear array of words: the low byte holds the character, the high byte its foreground and background colors. We haven’t seen the last of this space.

  • Starting at c000:0000, things get murkier. Usually, the 32K starting here is where the video BIOS lives. After that is the hard disk BIOS, if there is any. Everything from d000:0000 to f000:0000 was intended for expansion; add-ons were expected to set themselves up at a particular place and define how much memory they were going to use. Of course, conflicts were all too easy.

  • The last big block of memory begins at f000:0000. It’s meant for the BIOS, though the original one only used the last 8K (f000:e000 and up). By the way, the processor starts at f000:fff0, which is why the first BIOS went at the end of the block.

  • Finally, the last segment, ffff:xxxx, wraps around to zero on an 8086, but it can access a little bit of higher memory on the 286 and later. That’s the high memory area that DOS users learned so much about back in the day. It won’t really concern us much, but it’s good to know it exists, especially if you’re reading older literature that refers to it.

The BIOS

BIOS (Basic Input/Output System) is the name given to the PC’s internal code, as well as the programmer-accessible interfaces that code provides. The BIOS is mainly responsible for taking a bare x86 processor and turning it into a usable system; in that, it’s like the system ROMs of older computers. Different BIOSes over the years have offered numerous features—some of them can now run entire operating systems, while Coreboot is an OS—but they all also contain the same basic functions that you had 30 years ago.

Rather than using a vector table, like that of a Commodore 64, the PC BIOS provides access to its API through interrupts. By placing certain values in registers before invoking the interrupt, you can select different functions and pass arguments to them. Return values would likewise be placed in registers, sometimes the same ones. As an example, BIOS interrupt 10h controls the video system, and one of its available functions is setting the cursor position. To do that, we need to load the number 2 into AH, among other things, as in this snippet:

curs_set:
    mov ah, 2   ; function number
    mov bh, 0   ; "page number": 0 is the "main" page

    ; these would probably be previously defined
    mov dh, row
    mov dl, column

    int 10h

We’ll be seeing the BIOS later on, but we definitely won’t meet more than a fraction of its interface.

Peripherals and I/O

The PC was designed with peripherals in mind. Floppy disks, hard disks, printers, and many other things came to be attached to these computers. Usually, they came with drivers, and the BIOS could talk to some of them, but programmers occasionally needed to access them directly. Nor were these methods mutually exclusive. After all, it was entirely common to bypass the BIOS and write directly to video memory, because that was far faster.

Under our assumptions, we’ve got a VGA graphics adapter. We’ve got our keyboard and mouse, which the emulator…well, emulates. The FreeDOS image we’ll be using includes one floppy disk, but that’s it. Other than that, we have the PC speaker and a few other odds and ends. Not much, but plenty to explore. But that’s for later.

The operating system

Today, we use Windows or Linux or OSX or whatever, and we only really think about the operating system when it won’t work. But almost every home PC of the late 80s ran DOS first, and that’s what people learned. FreeDOS is like an extension of that, an upgraded version that can run on modern machinery. It uses the same interfaces for “legacy” code, however, and that’s what we want. And v86 includes a preset FreeDOS disk image for us, so that’s a bonus.

DOS, as derided as it is today, actually packed a lot of functionality into its tiny innards. Sure, it didn’t have true multitasking, and you needed hacks to use more than 1 MB of memory. Some of its internal functions were slow enough that it was better to write to the hardware directly. And it even left some things up to the BIOS. But it was far better than nothing.

I could make this a “write a 16-bit OS” series, but I won’t. Because I don’t have to. That’s one of the few good things about DOS: it’s so simple that you can almost ignore it, but you can still use it when it’s what you need. So that’s what I’ll be doing. I mean, it worked for fifteen years, right?

And so on

There’s lots more I could say about the PC platform, whether the real thing or our nostalgic remembrance. But this post is already getting too long. Everything else can come up when it comes up, because now we have to move on to the whole reason for this series: the x86 assembly language. Next time, you’ll see what it looked like (not what it looks like now, though, because it’s totally different). Maybe we can even figure out why so many people hated it with such passion. Oh, and we’ll have some actual code, too.

Assembly: the architecture

The x86 gets a lot of hate for its architecture. In modern days, it’s really more out of habit than any technical reason, but the 16-bit variant really deserved some of its bad reputation. Other parts, in my opinion, were overstated. Hyperbole then became accepted wisdom, and positive reinforcement only made it grow.

But I’m not here to apologize for the x86. No, this post has a different purpose. It’s an overview of the x86 architecture from the point of view of an assembly programmer circa 1991. Although the 386 had been out for a few years, older processors were still around, and not too much was making full use of the 32-bit extensions that the 386 brought. DOS, particularly, remained 16-bit, though there were extensions to address more memory. For the most part, however, we’ll stick to “real mode”, like it was in the olden days.

The CPU

An x86 processor of this vintage was a lot less complex than today’s Intel i7 or AMD A10, and not just because it didn’t include integrated graphics, megabytes of on-die cache, power management functions, and so on. Later lines have also added lots of assembly-level goodness, like MMX and SSE. They’re 64-bit now, and that required the addition of “long mode”.

But let’s ignore all that and look at the “core” of the x86. There’s not that much to it, really. In the original conception, you have about a dozen programmer-accessible registers, all of which started out 16 bits wide, but exactly one of these is truly general-purpose. The registers can be divided into a few categories, and we’ll take each of them in turn.

General registers

These are the “big four” of the x86: AX, BX, CX, and DX. As I said last time, all four can also be accessed as a pair of byte-sized registers. The high bytes are identified by the first letter of the register name followed by H, while the low bytes use L. So we have AL or BH or whatever. It doesn’t actually increase the number of registers we have, but sometimes the savings from loading only a single byte can add up. Remember, older computers had less memory, so they had to use it more wisely.

Each of the four 16-bit general registers is used in its own special way. AX is the accumulator (like the A register from the 6502), and it’s usually the best for arithmetic; some instructions, like the multiplication instruction MUL, require it. BX is used as a “base” for a few addressing-type instructions. CX is semi-reserved as a loop counter. And DX is sometimes taken as an “extension” of AX, creating a kind of 32-bit register referred to as DX:AX.

Of course, if you’re not using the instructions that work on specific registers, you can do what you like with these. Unlike the 6502, where almost everything involved a memory access, x86 does let you work register-to-register. (On the other hand, it doesn’t have cheap access to the zero page, so there.) You can add AX to BX, for instance, and no one will care.

Pointer registers

The other four main registers all have something to do with pointers and addressing. You can use them as scratch space for arithmetic, but a lot of instructions assume they hold addresses. Unlike the general registers, all these are only 16-bit. (Modern systems do give you special access to the low byte of them, however.)

SP is the big one out of this group: the stack pointer. Stacks are a lot more important on the x86 than the 6502, mainly because that’s where you put your “extra” data that won’t fit in registers. But programmers usually don’t manipulate SP directly. They instead pop and push (note the terminology change from 6502), and those instructions change SP as needed. BP is an extra pointer register mostly used by languages like C to access stack “frames”, but assembly programmers can turn it into a general pointer.

The other two come in a pair: SI and DI. These stand for “source index” and “destination index”, respectively, and the processor uses them for certain load and store operations. Quite a few of the DOS and BIOS APIs expect them to hold pointers to input and output parameters. And on an early x86, they were the best option for indirect addressing, a bit like the 6502’s X and Y registers.

System registers

The instruction pointer, IP, controls the execution of code. It’s not directly accessible by programmers; instead, you change it through branching (jumping, in x86 parlance) and subroutine calls. In other words, you can mostly act like it’s not there.

The register that holds the flags, usually called FLAGS when it needs a name, also can’t directly be read from or written into. You can push it to the stack, however, then manipulate it from there, but the main three flags (carry, direction, and interrupt) have special instructions to set and clear them, similar to the 6502.

While the x86 has quite a few more flags than the 6502, most of them aren’t too important unless you’re delving deep into an operating system’s internals. The main ones to know about are the carry, zero, sign, direction, overflow, and interrupt flags. Most of them should be self-explanatory, while “overflow” works in a similar fashion to its 6502 counterpart. The direction flag is mostly used for string-handling instructions, which we’ll see in a later post.

One more register deserves a brief mention here. On the 286, it’s called the MSW, or “machine status word”. After that, it gets the official designation CR0. It’s used to control internal aspects of the processor, such as switching between real and protected modes or emulating a floating-point unit. I can’t think of a case where this series would use it, but now you know it’s there.

Segment registers

And then we come to the bane of many an assembly programmer, at least those of a generation or two ago: the segment registers. We’ll look at the x86 memory model in a moment; for now, just think of segments as something like overlapping banks of memory.

We’ve got four segment registers, all 16 bits wide even in 32-bit mode, for the code, data, stack, and extra segments. Their mnemonic names, conveniently enough, are initialisms: CS, DS, SS, and ES. CS points to the segment where execution is occurring; you can’t change it except with a “far” call, but you can read from it. SS holds the segment address of the stack, but you probably figured that one out already. DS is the default for reading and writing memory, while ES, as its name suggests, is for whatever you like.

Segment registers are weird. You can move values to and from them (except into CS, as I said), but you can’t operate on them. What you can do, however, is use them to “override” an address. For example, loading a value from memory uses DS as its base, but you can make it use ES instead: mov ax, [es:di] loads the value pointed to by DI, but in the ES segment.

Memory model

And that leads us to the x86 memory model. It’s a bit convoluted, since the original 8086 was designed as a 16-bit system that could address 1 MB of memory. Something had to give, but Intel took a…nonstandard approach.

Every address on the x86 has two parts: segment and offset. (This is true even on today’s processors, but 64-bit mode is hardwired to treat all segments as starting at address 0.) In real mode, as with an older x86 running DOS, an actual memory address can be obtained by shifting the segment 4 bits to the left and adding the offset. Or, to put it in code: address = (segment << 4) + offset. Each segment, then, can address a 64K block of memory in a fashion much like banking in the 8-bit world.

The difference between one segment and the next is only 16 bytes, thanks to the 4-bit shift. That means that segments will overlap. The addresses b000:8123 and b800:0173, for example, refer to the same memory location: 0xb8123. In practice, this doesn’t matter too much; the segment portion is mostly used as a base address, while the offset is, well, an offset.

In protected mode, throw those last paragraphs out. Segments instead are indexes into a lookup table, creating a virtual memory system that essentially went unused until its mercy-killing by AMD when they brought out the x86-64. (The segment-based virtual memory scheme of protected mode, interesting though it seemed, was basically an exercise in the knapsack problem.) We won’t be worrying about protected mode much, though, so let’s move on.

Back to real mode, the world of DOS and early x86s. A little bit of arithmetic shows that the segment:offset addressing method allows access to 1 MB of memory, more or less. 0000:0000 is, of course, address 0, and it’s the lowest possible value. The highest possible is ffff:ffff, and that presents a problem. Add it up, and you get 0x100fef. On old systems, this simply wrapped around the 1 MB “barrier”, to 0x00fef. (Once memory sizes expanded to multiple megabytes, it no longer overflowed, but some programs relied on that behavior, so a hardware hack was put into place. It’s called the A20 gate, and it was originally put in the keyboard controller, of all places. But I digress.)

Input/output

Also available to the x86 are the I/O ports. These are accessed using the IN and OUT instructions, a byte, word, or (in 32-bit mode) double-word at a time. They function like their own little address space, separate from main memory. The x86 architecture itself doesn’t really define which ports do what. That’s left to the PC platform—which will be the subject of the next post.

Modern operating systems also allow memory-mapped I/O access, but we’ll leave that alone for the time being. It’s far more useful when you go beyond the bounds of the 16-bit world.

Interrupts

Like the 6502, the x86 has support for interrupting the processor’s normal flow, but it goes about it in a different way. An interrupt can be caused by hardware; the old term IRQ, “interrupt request”, referred to this. But software can also directly invoke interrupts. As we saw last week, that’s how DOS implemented its API.

In real mode, the result is the same either way. The processor jumps to a specific memory location and executes a bit of code, then returns to what it was doing like nothing happened. We’ll see the details later on, but that’s the gist of it.

To be continued

So I’ve rambled on long enough for this post. Next up will be a look at the PC platform itself, at least as it stood a quarter century ago. That’ll be a look into deep history for many, but the choices made then still affect us today. Following that will be the dive into the deep end of “old-school” x86 assembly.

Assembly: the precursor

Before Windows was a thing, there was DOS. Before today’s 64-bit processors, or even their 32-bit predecessors, there was the 16-bit world. In a way, 16-bit DOS is a bridge between the olden days of 8-bit microprocessors like the 6502 and the modern behemoths of computing, like the A10 in the PC I’m writing this on, or the Core i5 in the next room.

A while back, I started writing a short series of posts about assembly language. I chose the 6502, and one of the reasons why was the availability of an online assembler and emulator. Very recently, a new challenger has come into this field, Virtual x86. Even better, it comes with a ready-to-go FreeDOS image with the NASM assembler pre-installed. Perfect.

So, I’m thinking about reviving the assembly language series, moving to the (slightly) more modern architecture of 16-bit x86 and DOS. It’s a little more relevant than the 6502, as well as much more forgiving, but the fundamentals of assembly are still there: speed, size, power. And, since 16-bit code doesn’t run at all on 64-bit x86 CPUs, we don’t have to worry as much about bad habits carrying over. The use of DOS (FreeDOS, specifically) helps, too, since essentially nothing uses it these days. Thus, we can focus on the code in the abstract, rather than getting bogged down in platform-specific details, as we’d have to do if we looked at “modern” x86.

A free sample

Assembly in this old-school fashion is fairly simple, though not quite as easy as the venerable 6502. Later on, we can delve into the intricacies of addressing modes and segment overrides and whatnot. For now, we’ll look at the 16-bit, real-mode, DOS version of everyone’s first program.

org 100h

    mov dx, data
    mov ah, 09h
    int 21h
    mov ah, 04ch
    int 21h

data:
    db 'Hello, World!$'

All this little snippet does is print the classic string to the screen and then exit, but it gives you a good idea of the structure of x86 assembly using what passes for the DOS API. (Oh, by the way, I copied the code from Virtual x86’s FreeDOS image, but I don’t see any other way you could write it.)

Here are the highlights:

  • org 100h defines the program origin. For the simplest DOS programs (COM files), this is always 100h, or 0x100. COM files are limited to a single 64K segment, and the first 256 bytes are reserved for the operating system, to hold command-line arguments and things like that.

  • The 16-bit variety of x86 has an even dozen programmer-accessible registers, but only four of these are anywhere close to general-purpose. These are AX, BX, CX, and DX, and they’re all 16 bits wide. However, you can also use them as byte-sized registers. AH is the high byte of AX, AL the low byte, and so on with BH, BL, CH, CL, DH, and DL. Sometimes, that’s easier than dealing with a full word at a time.

  • mov is the general load/store instruction for x86. It’s very versatile; in fact, it’s Turing-complete. Oh, and some would say it’s backwards: the first argument is the destination, the second the source. That’s just the way x86 does things (unless you’re one of those weirdos using the GNU assembler). You get used to it.

  • int, short for “interrupt”, is a programmatic way of invoking processor interrupts. The x86 architecture allows up to 256 of these, though the first 16 are for the CPU itself, and the next are taken up by the BIOS. DOS uses a few of its own for its API. Interrupt 0x21 (21h) is the main one.

  • Since there are only 256 possible interrupts and far more useful operations, the OS needs some way of subdividing them. For DOS, that’s what AH is for. A “function code” is stored in that register to specify which API function you’re calling. The other registers hold arguments or pointers to them.

  • Function 0x09 (09h) of interrupt 0x21 writes a string to the console. The string’s address is stored in DX (with some segment trickery we’ll discuss in a later post), and it must end with a dollar sign ($). Why? I don’t know.

  • Function 0x4c (04ch) exits the program. AL can hold a return code. Like on modern operating systems, 0 is “success”, while anything else indicates failure.

  • db isn’t an actual assembly instruction. Much like in 6502-land, it defines a sequence of literal bytes. In this case, that’s a string; the assembler knows how to convert this to an ASCII sequence. (“Where’s Unicode?” you might be wondering. Remember that DOS is halfway to retirement age. Unicode wasn’t even invented before DOS was obsolete.)

Going from here

If you like, you can run the FreeDOS image on Virtual x86. It comes with both source and executable for the above snippet, and the original source file even includes compilation directions. And, of course, you can play around with everything else the site offers. Meanwhile, I’ll be working on the next step in this journey.

Software internals: Lists

Up to now, we’ve been looking at the simplest of data structures. Arrays are perfectly linear. Strings are mostly so, until you add in the complications of modern character sets. Objects need a little bit more help, but they can often be reduced to arrays of data. Not so this time around.

Lists are incredibly important to a lot of programming work. In today’s popular languages, they’ve essentially taken over the fundamental position once occupied by arrays. In Python, Ruby, and so many other languages, it’s the list that is the main sequence. In Lisp and Clojure, it’s the universal one, in the same way that NAND is the universal gate in electronics: everything else can be built from it.

But why are lists so useful? Unlike arrays, they lend themselves to dynamic allocation. Elements packed into an array don’t have anywhere to go. If you want to resize the structure, you often have to move everything around. In C++, for example, a vector (essentially a dynamically-sized array) can have elements added and removed, but that can trigger a full resizing, which can move the vector to a different address in memory, thus rendering all your pointers and iterators useless. Higher-level languages don’t have these problems with arrays, but lists never have them in the first place. They’re made to be changed.

Linked lists

The most common method for making a list is the linked list. At its core, a linked list is nothing but a sequence of nodes. A node is typically defined as a structure with a data member (one piece of the data you’re storing in the list) and a link member. This will usually be a pointer to the next node in the list, or, for the last item, a null pointer. In code, it might look like this:

// It is "item type"
struct Node<It>
{
    It data;
    Node<It>* next;
};

A list, then, could be represented as nothing more than a pointer to the first Node. By “walking” through each next pointer, a function can visit each node in order. And that’s all there really is to it.

Working with this kind of linked list isn’t too hard. Finding its length, a common operation in code, doesn’t take much:

// This is assumed throughout, but we'll make it explicit here
using List<It> = Node<It>*;

size_t length(List<It> li)
{
    size_t counter = 0;

    while (li)
    {
        ++counter;
        li = li->next;
    }

    return counter;
}

There are plenty of optimizations we can do to improve on this (it’s O(n)), but it should illustrate the basic idea. If you prefer a functional approach, you can do that, too:

// FP version
size_t length(List<It> li)
{
    if (!li)
        return 0;
    else
        return 1 + length(li->next);
}

That one looks a lot better in a proper FP language, but I wanted to stick to a single language for this post.

Inserting new elements is just manipulating pointers, and changing their value can be done by altering the data members. In general, that’s all you need to know. Even higher level languages are largely based on this same foundation, no matter what interface their lists present to the programmer.

More links

But this singly-linked list can be a bit cumbersome. Almost every operation involves walking the list from the beginning, and there’s no real way to get to an earlier element. That very need naturally leads to the creation of the doubly-linked list.

Here, each element has two pointers: one to the next element, the other to the previous one. It’s anchored on both sides by null links, and it’s otherwise the same principle as the singly-linked version, with the only downside being a slight increase in memory use. In code, such a structure might look like this one:

struct DNode<It>
{
    It data;
    DNode<It>* prev;
    DNode<It>* next;
}

Code that doesn’t care about going “backwards” can ignore the prev pointer, meaning our length function from earlier works with doubly-linked lists, too. (We’d need to change the argument type, of course.) Now, though, we get to reap the rewards of having two links. We no longer need to worry about getting a pointer to the beginning of the list, for one; any pointer to a list element can now be used to find its start.

Doubly-linked lists are so much more useful than singly-linked ones that they’re really the default in modern times. The C++ standard library had only doubly-linked lists until 2011 brought slist, for instance. And high-level languages usually don’t give you a choice. If they use a linked list, it’ll have (at least) two links per node.

Another option

The drawback of linked lists is the time it takes to find anything in them. Most operations are going to require you to walk some, if not all, of the list. The bigger the list gets, the longer it takes to walk it. In other words, it’s O(n) time.

Different systems get around this in different ways. One possibility is to forgo the use of linked lists entirely, instead basing things around an array list. This is nothing more than a dynamically-sized array like the C++ vector or Java ArrayList, but it can be used like a list, except that it also has a random access interface.

Most of the time, an array list will have a reference to its own block of memory, enough to hold all its current elements plus a few extra. When it runs out of that space, it’ll allocate a new, larger buffer, and move everything to that. On the inside, it would look something like:

struct AList<It>
{
    size_t capacity;
    size_t max_capacity;
    It* data;
}

Array lists work a lot better in an object-oriented system, because you can use methods to simplify the interface, but there’s no reason you need them. Here, for example, is a non-OOP access method for our AList above:

It at(AList<It> li, size_t pos)
{
    if (pos >= li.capacity)
        // Out-of-bounds error...

    return li.data[pos];
}

Inserting is trickier, though, because of the possibility of reallocation:

void insert(AList<It> li, It item)
{
    if (li.capacity == li.max_capacity)
    {
        resize_array(li, li.max_capacity * 2);
    }

    data[capacity] = item;
    ++capacity;
}

Our hypothetical resize_array function would then fetch a new block of memory with double the space, copy all the elements over to it, update max_capacity, and change data to point to the new block. Not hard to do, but non-trivial, and the copy can be time-consuming with large array lists. (It runs in what’s called “amortized” O(n). If you don’t cause a reallocation, then inserts are constant-time. If you do, they’re linear, because of the copy.)

Array lists are probably used more often than linked lists as the backing for high-level languages, and they’re the go-to option even for C++, Java, and C#. That’s because you tend to do a lot of insertions into their lists. If the system can allocate a big enough buffer at the start, then inserting into an array list is no different, internally, from inserting into an empty space in a regular array. Deletions are always that easy.

But the linked list approach will come in handy later on, when we look at trees, and they also have a few key advantages. As always, it’s a balance. If you need random access to elements, array lists are better. If you’re doing a lot of inserting at the ends of the structure, and not all at once, linked lists start to become a more attractive option. And if you’re using a high-level language, you’d use whatever is available. It still helps to know what you’re getting into, though. Knowing how the different types of list work can help you plan your own code’s structure better, even if you never have the choice of which kind to use.

Languages I love

A while back, I talked about the four programming languages I hate. Now, you get the contrast: the languages I love. Just by comparing this list to the last one, you’ll probably get a good idea of how I think, but I don’t mind. It’s my opinion. No matter how much justification I give, it’ll never be anything more. Of course, that also means that I can’t possibly be wrong, so don’t bother telling me that I am.

C++

C++ is almost universally reviled, it seems. I’m in the minority on this one, but I’m not ashamed. C++, especially the “modern” C++ that we got in 2011, is one of my favorite languages. It’s fast (running, not compiling), versatile, and ubiquitous. It really is the everything language. To some, that’s its biggest flaw, but I find it to be its greatest strength.

I don’t like languages that force me to think a certain way. That’s part of the grudge I have against Java, and it’s most of the reason I put Haskell on the “hate” list. I’d rather have a language smart enough to say, “Hey, you know what you’re doing. Use this if you need it, but if you don’t want to, that’s cool.”

C++ is like that. Want to write an imperative program? No problem. Like your functional programming instead? Every release this decade has only added FP tools. Object-oriented? It’s still around. C++ is called a multi-paradigm language, and it truly deserves that title.

It’s not a perfect language by any means. Most of the problems stem from the necessary backwards-compatibility with C, but everyone knows (or should know) not to use those bits unless absolutely necessary. And they’re increasingly unnecessary. We don’t have to deal with raw pointers anymore. Nobody should ever be calling malloc or creating static arrays (of the int a[42] kind) in modern C++.

Every programming language is a compromise, but I think the good far outweighs the bad in C++. It was a close thing last decade, but now it’s no contest. Smart pointers, templates, RAII, constexpr, C++ just has too many good features. If we could drop the C stuff and get some better string handling, we might just have the perfect language.

Haxe

I’ve written about Haxe before, but I’ll reiterate some of the things I said.

Haxe basically started out as a better ActionScript, but it’s so much more, and it’s one of those languages I wish was better known. Pay attention, language makers, because this is how you do strongly-typed. It’s also got functional stuff if you want it, OOP if that’s your preference, and a little bit of generic programming. It’s not quite as multi-paradigm as C++, but it’s far better than most.

The main flaw with Haxe is probably its not-quite-what-you’d-expect idea of “cross-platform”. Haxe has a compiler, but no way to make native binaries by itself. You can get C++ output, and Haxe can start up GCC or Clang for you, but that’s the best you ‘re going to get. Beyond that, the library support is lacking, and the documentation could use some work. The language itself, though, is solid.

This is a language I want to use. It’s just hard to come up with a reason I should. Maybe, if Haxe gets more popular, people will stop seeing it as “ActionScript++” and see it for what it is: one of the most interesting programming languages around.

Scala

Put simply, Scala is what Java should be.

Internally, it’s a horrible mess, I’ll grant. I’ve seen some of the “internal” and “private” APIs, and they only serve to make my head hurt. But the outer layer, the one we coders actually use, is just fine. Scala gives you functional when you want it, OOP or imperative when you don’t. It lets you do the “your variable is really a constant” thing that FP guys love so much. (I personally don’t understand that, but whatever.) But it doesn’t force you into anything. It’s multi-paradigm and, more importantly, trusting. Exactly what I want from a programming language.

Scala is where I first learned about pattern matching and actors. That makes sense, as those are two of its biggest strengths, and they aren’t things that show up in too many other languages…at least not the ones people actually use. But they’re not all. Where it can, Scala gives you back the things Java took away, like operator overloading. Yet it’s still compatible with Java libraries, and it still runs on the JVM. (Supposedly, you can even use it on Android, but I’ve never managed to get that to work for anything more complex than “Hello, World!”)

If Java is old-school C, Scala is modern C++. That’s the way I see it. Given the choice, I’d rather use it than plain Java any day of the week. Except for speed, it’s better in almost every way. It may not be the best language out there—it does have its flaws—but it’s one of the best options if you’re in the Java “ecosystem”.

The field

I don’t love every language I come across, nor do I hate all the ones that aren’t listed above. Here are some of the ones that didn’t make either cut:

  • Python: I really like Python, but I can’t love it, for two reasons. One, it has many of the same problems as Ruby regarding speed and parallelism. Two, the version split. Python 3 has some good features, but it’s a total break with Python 2, and it has a couple of design decisions that I simply do not agree with. (Changing print from statement to function is only a symptom, not the whole problem.)

  • Perl: I find Perl fascinating…from a distance. It’s the ultimate in linguistic anarchy. But that comes with a hefty price in both execution speed and legibility. Modern Perl 5 doesn’t look as much like line noise as the older stuff, but it’s definitely not like much else. And the less said about Perl 6, the better.

  • JavaScript: You could say I have a love/hate relationship with JavaScript. It has a few severe flaws (this handling, for instance), but there’s a solid core in there. Remember how I said that C++’s good outweighs its bad? With JS, it’s almost a perfect balance.

  • Lisp: Specifically, that means Clojure here, as that’s the only one I’ve ever really tried. Lisp has some great ideas, but somebody forgot to add in the syntax. It’s too easy to get dizzy looking at all the parentheses, and editors can only do so much.

  • C: As much as I blame C for C++’s faults, I don’t hate it. Yeah, it’s horrible from a security and safety standpoint, but it’s fast and low-level. Sometimes, that’s all you need. And we wouldn’t have the great languages we have today if not for C. Can you imagine a world dominated by Pascal clones?

  • C#: I don’t hate C#. I hate the company that developed it, but the language itself is okay. It’s a better Java that isn’t anywhere near as compatible. Too much of it depends on Microsoft stuff for me to love it, but it fixes Java enough that I can’t hate it. So it falls in the middle, through no fault of its own.

The problem with emoji

Emoji are everywhere these days. Those little icons like 📱 and 😁 show up on our phones, in our browsers, even on TV. In a way, they’re great. They give us a concise way to express some fairly deep concepts. Emotions are hard to sum up in words. “I’m crying tears of joy” is so much longer than 😂, especially if you’re limited to 140 characters of text.

From the programmer’s point of view, however, emoji can rightfully be considered a pox on our house. This is for a few reasons, so let’s look at each of them in turn. In general, these are in order from the most important and problematic to the least.

  1. Emoji are Unicode characters. Yes, you can treat them as text if you’re using them, but we programmers have to make a special effort to properly support Unicode. Sure, some languages say they do it automatically, but deeper investigation shows the hollowness of such statements. Plain ASCII doesn’t even have room for all the accented letters used by the Latin alphabet, so we need Unicode, but that doesn’t mean it’s easy to work with.

  2. Emoji are on a higher plane. The Unicode character set is divided into planes. The first 65,536 code points are the Basic Multilingual Plane (BMP), running from 0x0000 to 0xFFFF. Each further plane is considered supplemental, and many emoji fall in the second plane, with code points around 0x1F000. At first glance, the only problem seems to be an additional byte required to represent each emoji, but…

  3. UCS-2 sucks. UCS-2 is the fixed-width predecessor to UTF-16. It’s obsolete precisely because it can’t handle higher planes, but we still haven’t rid ourselves of it. JavaScript, among others, essentially uses UCS-2 strings, and this is a very bad thing for emoji. They have to be encoded as a surrogate pair, using two otherwise-invalid code points in the BMP. It breaks finding the length of a string. It breaks string indexing. It even breaks simple parsing, because…

  4. Regular expressions can’t handle emoji. At least in present-day JavaScript, they can’t. And that’s the most used language on the web. It’s the front-end language of the here and now. But the JS regex works in UCS-2, which means it doesn’t understand higher-plane characters. (This is getting fixed, and there are libraries out there to help mitigate the problem, but we’re still not to the point where we can count on full support.)

  5. Emoji are hard to type. This applies mostly to desktops. Yeah, people still use those, myself included. For us, typing emoji is a complicated process. Worse, it doesn’t work everywhere. I’m on Linux, and my graphical applications are split between those using GTK+ and those using Qt. The GTK+ ones allow me to type any Unicode character by pressing Ctrl+Shift+U and then the hexadecimal code point. For example, 😂 has code point 0x1F602, so I typed Ctrl+Shift+U, then 1f602, then a space to actually insert the character. Qt-based apps, on the other hand, don’t let me do this; in an impressive display of finger-pointing, Qt, KDE, and X all put the responsibility for Unicode handling on each other.

So, yeah, emoji are a great invention for communication. But, speaking as a programmer, I can’t stand working with them. Maybe that’ll change one day. We’ll have to wait and see.

Software internals: Classes

Whether you use object-oriented programming or not, you’ve most likely encountered the class and the object. You’re probably coding in a language that has or uses both of them. If you’re not (say you’re using C or something), you still might be using the idea of a class and an object. How they actually work depends on the language, library, or framework in question, but the basic implementation isn’t too different from one to the next.

Objects, no matter what kind you’re using, are data structures. Classes are data types. In many languages, both of these work together to make OOP possible. But you can have one without the other. JavaScript, for instance, has objects, but no classes. (ES6’s class is mere syntactic sugar over a prototype-based implementation.) Since that’s more common than the alternative (classes but no objects), we’ll look at objects first, then add in the bits that classes or their replacements provide.

Objects

Objects, from a language-neutral point of view, are made up of a number of different variables usually called fields. They’re a composite kind of data, like C the struct, except that higher-level languages have a lot more support for doing things with them. But that’s all they really are: a bunch of fields.

That already suggests one way of laying out an object in memory: allocate the fields consecutively. For a struct, you don’t need to do anything else except initialize the block of memory. If one of your fields is itself an object, then that’s okay. You can nest them. You have (rather, the compiler has) the knowledge of what goes where, so it’s not a problem.

The above option works fine in a system where objects are static, where their layouts don’t change, only the contents of their fields. Dynamic languages that let you add and remove object fields need a different approach. One that’s commonly used is a map. Basically, field names (as strings or a special, faster “symbol” type) are associated with chunks of data, and the object is nothing more than a collection of name-value pairs. Using hashes and other tricks (that we may visit in a future post), this can be very fast, though never as fast as direct memory access.

Methods

Methods are functions that just happen to have a special binding to a particular type of object. Different object systems define them in different ways, but that core is always the same. When code calls a method, that method “knows” which object it belongs to, even though it was (usually) defined generically.

Python and object-oriented C libraries like GTK+ make this explicit: every method takes an object as its first parameter. JavaScript takes a different tack, adding methods to the prototype object. Most every other case where methods exist, they’re implicitly made such by their definitions. C++, C#, and Java, for instance, simply let you define functions inside the class, and those are the methods for objects of that class. When they’re called, they receive a “hidden” parameter, a reference to the object they’re called on: this.

As functions can contain arbitrary code, we can’t exactly put them in memory with the fields. One, it kills caching, because you might have kilobytes of method code in between fields. Two, some operating systems have protection systems in place to prevent code and data from intermingling, for very good security reasons.

Instead of having the code and data together, we must separate them. But that’s fine. They don’t need to be mixed in together anyway. However methods are defined, they’ll always have some sort of connection to an object—a pointer or reference, in low-level terms—and they can use that to access its fields. Conversely, the structure of an object can contain function pointers that refer to its methods.

Inheritance

We don’t really start getting into object-oriented programming until we add in inheritance. Coincidentally, here’s where the internals start to become more and more complex. Simple single inheritance lets an object take on parts of a parent object. It can use the parent’s methods as if they were its own, as well as some of the fields. Multiple inheritance is the same thing, but with more than one parent; it can get quite hairy, so most common languages don’t allow it.

The methods don’t really care whether they’re operating on a base or derived class, a parent or a child. For static languages, this is because of the way objects using inheritance are laid out in memory. Broadly speaking, the parent object’s fields come first, and those are followed by the child’s fields. As you go further down the inheritance chain, this means you can always backtrack to the root object. Just by doing that, we get a few things for free. A pointer to an object is the same as a pointer to its parent, for example. (GTK+ makes this explicit: objects are structs, and a child object simply lists its parent as its first field. Standard C memory access does the rest. Problem is, you have to use pointers for this to work, otherwise you get slicing, a mistake every C++ programmer knows all too well.)

Dynamic languages don’t get this benefit, but they all but emulate it. Objects might have a hidden field pointing to the parent object, or they may just copy the parent’s map of names and values into their own as their first act. The latter means extra metadata to keep track of which fields were defined where, but the former is slower for every access of an inherited field. It’s a classic size/speed tradeoff; most languages opt for the faster, but slightly more bloated, map-mixing approach.

For multiple inheritance, well, it’s a lot harder. In dynamic languages, it’s not quite as crazy, but the order of inheritance can make a difference. As an example, take a class C that inherits from classes D and E. If both of those have a field named foo, there’s a problem. C can’t have two foos, but the different base classes might use theirs in different ways. (The only modern static language I know that allows multiple inheritance is C++, and I don’t want to try to explain the memory scheme it uses. I’ll leave that to you to find out.)

Polymorphism

What makes object-oriented programming truly object-oriented is polymorphism. Child classes are allowed to effectively redefine the methods they inherit from their parents, customizing them, but the caller neither knows nor cares about this fact. This is used for abstraction, and it’s not immediately obvious how they do it.

Dynamic languages have a map for their methods, as they do for fields. For them, polymorphism is as easy as changing an entry in the map to refer to a different function…if that’s the way they choose to do it. Another option is only keeping track of the methods directly defined by this object, referring access to all others to the parent class, who might pass it up another level, and so on. For a language using single inheritance, this is linear, and not too bad. With multiple inheritance, method resolution becomes a tree walk, and it can get quite intensive.

Static languages can take advantage of the fixed nature of their classes and reduce polymorphism to a table of function pointers. C++ programmers know this as the v-table (or vtbl), so called because polymorphic methods in that language are prefixed with the keyword virtual; hence, “virtual table”. This table is usually kept in memory somewhere close to the rest of the object, and it will contain, at the very least, a function pointer for each polymorphic method. Those that aren’t overriding a method from a parent class don’t have to be listed, but not every language lets you make that decision.

Construction and destruction

An object’s constructor isn’t necessarily a method. That’s because it also has to do the work of allocating memory for the object, setting up any inheritance-related framing (v-tables, prototypes, whatever), and general bookkeeping. Thus, the constructor doesn’t even have to be connected to the class. It could just as easily be a factory-like function. Destructors are the same way. They aren’t specifically methods, but they’re too bound to a class to be considered free functions. They have to deallocate memory, handle resource cleanup, call parent destructors, and so on.

On the lower levels, constructors and destructors aren’t really part of an object. The object can never call them directly in most languages. (In garbage-collected languages, nobody can call a destructor directly!) Therefore, they don’t need to know where they are. The same is generally true for the C++-specific notions of copy and move constructors. The only wrinkle comes in with inheritance in the case of destructors, and then only in C++, where you can have polymorphic methods but not a polymorphic destructor; this is a bad thing, and it’s a newbie mistake.

Next up

I’ll admit that this post felt a lot smoother than the last two. Next time, we’ll look at another data structure that shows up everywhere, the list. Linked lists, doubly-linked lists, list processing, we’ll see it all. As it turns out, there’s not too much to it. Maybe those Lisp guys were onto something…

Languages I hate

Anyone who has been writing code for any length of time—anyone who isn’t limited to a single programming language—will have opinions on languages. Some are to be liked, some to be loved, and a few to be hated. Naturally, which category a specific language falls into depends on who you’re talking to. In the years I’ve been coding, I’ve seriously considered probably a dozen different languages, and I’ve glanced at half again as many. Along the way, I have seen the good and the bad. In this post, I’ll give you the bad, and why I think they belong there. (Later on, I’ll do the same for my favorite languages, of course.)

Java

Let’s get this one out of the way first. I use Java. I know it. I’ve even made money writing something in it, which is more than I can say for any other programming language. But I will say this right now: I have never met anyone who likes Java.

The original intent of Java was a language that could run “everywhere” with minimal hassle. Also, it had to be enough like C++ to get the object-oriented goodness that was in vogue in the nineties, but without all that extraneous C crap that only led to buffer overflows. So everything is an object—except for “primitive” types like, say, integers. You don’t get to play with pointers—but you can get null-pointer exceptions. In early versions of the language, there was no way to make an algorithm that worked with any type; the solution was to cast everything to Object, the root class underlying the whole system. But then they bolted on generics, in a mockery of C++ templates. They do work, except for the niggling bit called type erasure.

And those are just some of the design decisions that make Java unbearable. There’s also the sheer verbosity of the language, a problem compounded by the tendency of new Java coders to overuse object-oriented design patterns. Factories and abstract classes have their place, but that place is not “everywhere I can put them”. Yes, that’s the fault of inexperienced programmers, but the language and its libraries (standard and 3rd-party) only reinforce the notion.

Unlike most of the other languages I hate, I have to grin and bear it with Java. It’s too widespread to ignore. Android uses it, and that’s the biggest mobile platform out there. Like it or not, Java won’t go away anytime soon. But if it’s possible, I’d rather use something like Scala.

Ruby

A few years ago, Ruby was the hipster language of choice, mostly thanks to the Rails framework. Rails was my first introduction to Ruby, and it left such a bad taste in my mouth that I went searching for something better. (Haven’t found it yet, but hope springs eternal…) Every bit of Ruby I see on the Internet only makes me that much more secure in my decision.

This one is far more subjective. Ruby just looks wrong to me, and it’s hard to explain why. Most of it is the cleverness it tries to espouse. Blocks and symbols are useful things, but the syntax rubs me the wrong way. The standard library methods let you write things like 3.times, which seems like it’s trying to be cute. I find it ugly, but that might be my C-style background. And then there’s the Unicode support. Ruby had to be dragged, kicking and screaming, into the modern world of string handling, and few of the reasons why had anything to do with the language itself.

Oh, and Ruby’s pitifully slow. That’s essentially by design. If any part of the code can add methods to core types like integers and strings, optimization becomes…we’ll just say non-trivial. Add in the Global Interpreter Lock (a problem Python also has), and you don’t even get to use multithreading to get some of that speed back. No wonder every single Ruby app out there needs such massive servers for so little gain.

And even though most of the hipsters have moved on, the community doesn’t seem likely to shed the cult-like image that they brought. Ruby fans, like those of every other mildly popular language, are zealous when it comes to defending their language. Like the “true” Pythonistas and those poor, deluded fools who hold up PHP as a model of simplicity, Ruby fanboys spin their language’s weaknesses into strengths.

Java is everywhere, and that helps spread out the hate. Ruby, on the other hand, is concentrated. Fortunately, that makes it easy to ignore.

Haskell

This one is like stepping into quicksand—I don’t know how far I’m going to sink, and there’s no one around to help me.

Haskell gets a lot of praise for its mathematical beauty, its almost-pure functional goodness, its concision (quicksort is only two lines!) and plenty of other things. I’ll gladly say that one Haskell application I use, Pandoc, is very good. But I would not want to develop it.

The Haskell fans will be quick to point out that I started with imperative programming, and thus I don’t understand the functional mindset. Some would even go as far as Dijkstra, saying that I could never truly appreciate the sheer beauty of the language. To them, I would say: then who can?. The vast majority of programmers didn’t start with a functional programming language (unless you count JavaScript, but it still has C-like syntax, and that’s how most people are going to learn it). A language that no one can understand is a language no one can use. Isn’t that what we’re always hearing about C++?

But Haskell’s main problem, in my opinion, is its poor fit to real-world problems. Most things that programs need to do simply don’t fit the functional mold. Sure, some parts of them do, but the whole doesn’t. Input/output, random numbers, the list goes on. Real programs have state, and functional programming abhors state. Haskell’s answer to this is monads, but the only decent description of a monad I’ve ever seen had to convert it to JavaScript to make sense!

I don’t mind functional programming in itself. I think it can be useful in some cases, but it doesn’t work everywhere. Instead of a “pure” functional language, why can’t I have one that lets me use FP when I can, but switch back to something closer to how the system works when I need it. Oh, wait…

PHP

I’ll just leave this here.

Software internals: Strings

A string, as just about any programmer knows, is a bit of text, a sequence of characters. Most languages have some built-in notion of strings, usually as a fundamental data type on par with integers. A few older programming languages, including C, don’t have a separate “string” type, but they still have strings. Even many assemblers allow you to define strings in your assembly language code, though you’re left to deal with them yourself.

The early string

At its heart, a string really isn’t much more than a bunch of characters. It’s a sequence, like an array. Indeed, that’s one way of “making” strings: stuff some characters into an array that’s big enough to hold them. Very old code often did exactly that, especially with strings whose contents were known ahead of time. And there are plenty of places in modern C code where text is read into a buffer—nothing more than an array—before it is turned into a string. (This usually leads to buffer overflows, but that’s not the point.)

Once you actually need to start working with strings, you’ll want something better. Historically, there were two main schools of thought on a “better” way of representing strings. Pascal went with a “length-prefixed” data structure, where an integer representing the number of characters in the string was followed by the contents. For example, "Hi!" as a Pascal string might be listed in memory as the hexadecimal 03 48 69 21. Of course, this necessarily limits the length of a string to 255, the highest possible value of a byte. We could make the length field 16 bits (03 00 48 69 21 on a little-endian x86 system), bringing that to 65535, but at the cost of making every string a byte longer. Today, in the era of terabyte disks and gigs of memory, that’s a fair trade; not so in older times.

But Pascal was very much intended more for education and computer science than for run-of-the-mill software development. On the other side of the fence, C took a different approach: the null-terminated string. C’s strings aren’t their own type, but an array of characters ending with a null (00) byte. Thus, our example in C becomes 48 69 21 00.

Which style of string is better is still debated today, although modern languages typically don’t use a pure form of either of them. Pascal strings have the advantage of easily finding the length (it’s right there!), while C’s strlen has to count characters. C strings also can’t have embedded null bytes, because all the standard functions will assume that the null is only at the end. On the other hand, a few algorithms are easier with null-terminated strings, they can be as long as you like, and they’re faster if you don’t need the length.

In modern times

In today’s languages, the exact format of string doesn’t matter. What you see as the programmer is the interface. Most of the time, that interface is similar to the array, except with a few added functions for comparison and the like. In something like C#, you can’t really make your own string type, nor would you want to. But it’s helpful to know just how these things are implemented, so you’ll know their strengths and weaknesses.

Since everything ultimately has to communicate with something written in C, there’s probably a conversion to a C-style string somewhere in the bowels of any language. That doesn’t mean it’s what the language works with, though. A Pascal-like data structure is perfectly usable internally, and it’s possible to use a “hybrid” approach.

Small strings are a little special, too. As computers have gotten more powerful, and their buses and registers have grown wider, there’s now the possibility that strings of a few characters can be loaded in a single memory access. Some string libraries use this to their advantage, keeping a “small” string in an internal buffer. Once the string becomes bigger than a pointer (8 bytes on a 64-bit system), putting it in dynamic memory is a better deal, space-wise. (Cache concerns can push the threshold of this “small string optimization” up a bit.)

There are also a few algorithms and optimizations that string libraries can use internally to speed things up. “Copy-on-write” means just that: a new copy of a string isn’t created until there’s a change. Otherwise, two variables can point to the same memory location. The string’s contents are the same, so why bother taking up space with exact copies? This also works for “static” strings whose text is fixed; Java, for one, is very aggressive in eliminating duplicates.

UTF?

Nowadays, there’s a big problem treating strings as nothing more than an array of characters. That problem is Unicode. Of course, Unicode is a necessary evil, and it’s a whole lot better than the mess of mutually incompatible solutions for international text that we used to have. (“Used to”? Ha!) But Unicode makes string handling exponentially harder, particularly for C-style strings, because it breaks a fundamental assumption: one byte equals one character.

Since the world’s scripts together have far more than 255 characters (the most a byte can distinguish), we have to do something. So we have two options. One is a fixed-size encoding, where each character—or code point—takes the same amount of space. Basically, it’s ASCII extended to more bits per character. UTF-32 does this, at the huge expense of making every code point 4 bytes. Under this scheme, any plain ASCII string is inflated to four times its original size.

The alternative is variable-length encoding, as in UTF-8. Here, part of the “space” in the storage unit (byte for UTF-8, 2 bytes for UTF-16) is reserved to mark a “continuation”. For example, the character ë has the Unicode code point U+00EB. In UTF-8, that becomes C3 AB. The simple fact of the first byte being greater than 7F (decimal 127) marks this as a non-ASCII character, and the other bits determine how many “extended” bytes we need. In UTF-32, by contrast, ë comes out as 000000EB, twice as big.

The rules for handling Unicode strings are complex and unintuitive. Once you add in combining diacritics, the variety of spaces, and all the other esoterica, Unicode becomes far harder than you can imagine. And users of high-level, strings-are-black-boxes languages aren’t immune. JavaScript, for instance, uses UCS-2, a 16-bit fixed-width encoding. Until very recently, if you wanted to work with “high plane” characters—including emoji—you had some tough times ahead. So there’s still the possibility, in 2016, that you might need to know the internals of how strings work.

Leap Day special

Do you know what today is? Yesterday was February 28, so some badly-written programs might think it’s the 1st of March, but it’s not. Yep, it’s every programmers worst nightmare: Leap Day. You’re not getting a “code” post on Monday because I can’t read a calendar. Today’s special, so this is a special post.

Dates are hard. There’s no getting around that. Every part of our calendar seems like it was made specifically to drive programmers insane. Most years have 365 days, but every fourth one has 366. Well, unless it’s a century year, then it’s back to 365. Except every 400 years, like in the year 2000—those are leap years again. Y2K wasn’t as bad as it could’ve been, true, but there were quite a few hiccups. (Thanks to JavaScript weirdness, those never really stopped. Long after millennial fever died down, I saw a website reporting the current year as “19108”!) But 2000 was a leap year, and that surprised older software almost as much as actually being in the year 2000.

It gets worse. How long is a month? The answer: it depends. Weeks are always 7 days, thankfully, but you can’t divide 365 or 366 into 7 without a remainder. You’ll always have one or two days left over. And a day isn’t necessarily 24 hours, thanks to DST. Topping it all off, you can’t even assume that there are 60 seconds in a minute, because leap seconds. (That one is subject to change, supposedly. I’ll believe it when I see it.)

That’s just for the calendar we use today in the Western world. Add in everything else involving dates, and you have a recipe for disaster. The days in a month are numbered consecutively, right? Wrong! If you’re using, say, the Jewish calendar, you can’t even guarantee that two years have the same number of months. The Islamic calendar once depended on the sighting of the moon. The Maya had a calendar with 5 or 6 days that weren’t part of any month. Even our own Gregorian calendar doesn’t have a year zero. At some point, you’re left wondering how society itself has made it this far.

I’ll leave it to sites like The Daily WTF for illustrated examples of date handling gone wrong. God knows there are enough to choose from. (Dates in Java are especially horrendous, I can say from experience.) Honestly, I’d have to say that it’s more amazing to see date handling done right. By “right”, I mean handling all the corner cases: leap seconds, leap years, week numbers, time zones, DST adjustments, etc. ISO has a standard for dates, but…good luck with that.

So don’t be surprised to see a few things break today. And if you’re not a programmer, don’t feel bad. Sometimes, it feels like we can’t figure this out any better than you. That’s never more true than on this day.