Assembly: optimization in the past and present

In this post, I won’t be discussing assembly language in any depth. Rather, I want to focus on one of the main reasons to use assembly: optimization. Actually, it might be the main reason today, because there’s not much need for assembly coding these days; it’s only when we want something to be as fast as possible that it comes into play.

Also, I’m moving away from the 6502 for this one, instead using the x86 architecture for my examples. Why? Because x86 is still the leading processor family for desktops, and so much has been written about it over the decades. There’s a lot of optimization info out there, far more than for just about any other architecture. Yes, ARM is growing, especially in the lower-end part of the market where assembly can still be very useful, but ARM—due to its very nature—is so fragmented that it’s hard to give concrete examples. Also, because x86 is so long-lived, we can trace the development of various processor features through its evolution. For that, though, we’ll need a bit of a history lesson.

Starting gates

The first microprocessors, from a bird’s-eye view, weren’t exactly complicated in function. They took an instruction from memory, decoded it, executed it, then moved on, sometimes jumping around the executable code space. Decoding each instruction and performing it were the only real hard parts. That’s one reason why RISC was so highly touted, as the smaller, more fundamental instruction set required less chip space for decoding and execution. (Look up something like VAX assembly for the opposite—CISC—end of the spectrum.)

Fetching the instruction was a simple load from memory, something every processor does as a matter of course. Decoding required a major portion of the circuit (the 6502 used a programmable array a bit like a modern FPGA, except that its function was fixed in the factory) but a comparatively small portion of processor time. Executing could require more memory accesses for the instruction’s operands, and it could take a lot of time, especially for something complex like multiplication—an ability the 6502, among others, lacks.

The birth of parallelism

But memory has always been slower than the processor itself. On all but the most complicated instructions, memory access takes the most time of any phase of execution. Thus, the prefetch queue was born. In a sense, this was the forerunner of today’s cache. Basically, it tried to predict the future by fetching the next few bytes from memory. That way, the large time constants required for RAM access could be amortized.

The problem with the prefetch queue, as with all cache, comes with branching. Branches, as we saw in earlier posts, are the key to making decisions in assembly language. But they force the processor to jump around, instead of following a linear path through memory. A branch, then, negates the advantage of the prefetch queue.

Processor designers (and assembly programmers) tried a few methods of working around the cost of branching. That’s why, at a certain time long ago, loop unrolling was considered a very important optimization technique. If you need to run a particular group of instructions, say, ten times, then it was a bit faster to “copy and paste” the assembly instructions than it was to set up a branching loop. It used more space, but the speed gains made up for that.

Another optimization trick was rearranging the branch instructions so that they would fail more often than not. For example, the x86 has a pair of instructions, JZ and JNZ, that branch if the zero flag is set or clear, respectively. (This is equivalent to the 6502’s BEQ and BNE, except that the x86 has more registers and instructions that can change it.) If you have a section of code that is run only when an argument is 0, and 0 rarely shows up, the naive way of writing it would be to skip over that section with a JNZ. But it might be faster (on these earlier processors, at least) to put the “only if 0” code at the end of the subroutine (or some other place that’s out of the way) and use JZ to branch to it when you need it.

In the pipeline

Eventually, the interests of speed caused a fundamental shift in the way processors were made. This was the birth of the pipeline, which opened a whole new world of possibilities, but also brought new optimization problems. The prefetch queue described above was one of the first visible effects of pipelining, but not the last.

The idea of a pipeline is that the processor’s main purpose, executing code, is broken into smaller tasks, each given over to a dedicated circuit. These can then work on their own, like stations on an assembly line. The instruction fetcher gets the next instruction, passes it on to the decoder, and so on. A well-oiled machine, in theory. In practice, it’s hard to get all the parts to work as one, and sometimes the pipeline would be stalled, waiting on one part to finish its work.

The beauty of the pipeline is that each stage is distinctly ordered. Once an instruction has been retrieved, the fetcher isn’t needed, so it can do something else. Specifically, it can fetch the next instruction. If the timing works out, it can fill up the prefetch queue and keep it topped off when it has the free time.

Fortune-telling

But branches are the wrenches in the works. Since they break the linear flow of instructions, they force the pipeline to stall. This is where the processor designers had to get smart. They had to find a way of predicting the future, and thus branch prediction was popularized.

When it works, branch prediction can completely negate the cost of a conditional jump. (Of course, when it fails, it stalls the whole thing, but that’s no worse than not predicting at all.) From an assembly language point of view, it means that we could mostly ditch the clever tricks like loop unrolling and condition negation. They would still have their uses, but they wouldn’t need to be quite so common. That’s a good thing, because the extra code size brought by loop unrolling affected another part of these newfangled processors: the cache.

Cache really came about as another way to make memory access faster. The method was a little roundabout, but it worked, and cache has stuck with us through today. It’s only getting bigger, too; most of the physical space on today’s processors is, in fact, cache memory. Many chips actually have more memory on the die than the 4 MB my first PC had in total.

The trick to cache comes from looking at how code accesses memory. As it turns out, there’s a pattern, and it’s called the principle of locality. Put simply, reading one memory location is a pretty good indicator that you’re going to be reading the ones right next to it. If we could just load all of those at once, then we’d save a ton of time. So that’s what they did. Instead of loading memory a byte or a word at a time, they started pulling them in 16 or more at once. And it was faster, but only while you stayed in the region of memory loaded into the cache.

Soon, cache became not fast enough, not big enough, and they had to find ways to fix both of these problems. And that’s where we are today. Modern x86 chips have three levels of cache. The first, L1, is the smallest, but also the fastest. L2 cache is a bit slower, but there’s more of it. And L3 is the slowest (though still faster than RAM), but big enough to hold the entirety of, say, Windows 95.

The present state of affairs

So now the optimization strategy once again focuses on space. Speed is mostly a non-factor, as the desktop x86 processors can execute most of their instructions in a single clock cycle, branch prediction saves us from the cost of jumps, and huge amounts of cache mean fewer of the horrifically slow memory accesses. But cache is limited, especially the ultra-fast L1. Every instruction counts, and we want them to all be as close together as possible. (Data is the same way, but we’ll ignore it for now.) Unrolling loops, for example, is a waste of valuable cache.

A few other optimizations have been lost along the way, made obsolete by the march of progress. One of my favorite examples is that of clearing a register. It’s a common need in assembly language, and the favored method of doing it early in the x86 days was by using the XOR instruction, e.g., XOR AX, AX. Exclusive-OR, when given the same value twice, always produces 0, and this method was quicker (and shorter) than loading an immediate 0 value (MOV AX, 0).

The self-XOR trick was undone by a combination of factors. The first was register renaming, which essentially gave the processor a bunch of “virtual” registers that it could use as internal scratch space, moving data to and from the “real” ones as needed. The second was out-of-order execution, and that takes a little more explaining.

If you’ve ever looked at the optimized assembly output of your favorite high-level compiler (and you should, at least once), then you might have noticed that things aren’t always where you put them. Every language allows some form of rearranging, as long as the compiler can prove that it won’t affect the outcome of the program. For example, take a look at this C snippet:

int a, d;
int b = 4;
int c = 5;
for (a = 0; a < b; a++) {
    f(a);
}
d = b * c;

The final statement, assigning the value 20 to d, can be moved before the loop, since there’s no way that loop can change the value of b or c; the only thing the loop changes, a, has nothing to do with any other variable. (We’re assuming more code than this, otherwise the compiler would replace b, c, and d all with the simple constant 20.)

A processor can do this on the assembly level, too. But it has the same restriction: it can only rearrange instructions if there are no dependencies between them. And that’s where our little XOR broke. Because it uses the same register for source and destination, it created a choke point. If the next few instructions read from the AX register, they had to wait. (I don’t know for sure, but I’ve heard that modern processors have a special case just for self-XOR, so it lives again.)

Ending the beginning

This has been a bit of a rambling post, I’ll admit. But the key point is this: optimization is a moving target. What might have worked long ago might not today. (There’s a reason Gentoo is so fast, and it’s because of this very thing.) As processors advance, assembly programmers need new tricks. And we’re not the only ones. Compilers produce virtually all assembly that runs these days, but somebody has to give them the knowledge of new optimization techniques.