asm – Prose Poetry Code

Assembly: the old set

The x86 architecture, at the assembly level, is quite a bit more complex than our old friend, the 6502. Part of that comes from the fact that it does so much more: it has more registers, a bigger address space, and just more functionality altogether. So it has to be more complex. Maybe not as much as it is, but then it also had to maintain some backward compatibility with Intel’s older processors, and compatibility always complicates matters.

But we won’t judge it in this post. Instead, we’ll look at it as impartially as possible. If you’ve read the earlier posts in this series on assembly language, you know most of the conventions, so I won’t repeat them here. Like with the 6502,, I’ll look at the different instructions in groups based on their function, and I’m ignoring a lot of those that don’t have much use (like the BCD arithmetic group) or are specifically made for protected mode. What you get is the “core” set at the 286 level.

The x86 instruction set

Even stripped down to this essence, we’re looking at around 100 instructions, about double the 6502’s set. But most of these have obvious meanings, so I won’t have to dwell on them. Specifics will mostly come when we need them.

Also, since I’m assuming you’re using NASM (and an older version, at that), the instruction format in the examples I use will also be for that assembler. That means a couple of things:

The destination always comes first. So, to move the contents of the DX register to AX, you say mov ax, dx.
Square brackets indicate indirection. Thus, mov ax, bx moves the contents of BX into AX, while mov ax, [bx] moves the value in the memory location pointed to by BX.
NASM requires size suffixes on certain instructions. These are mostly the “string” instructions, such as MOVS, which you’d have to write as MOVSB or MOVSW, depending on the width of the data.

Flags

The x86, like most processors, comes with a number of flags that indicate internal state. And, as with the 6502, you can use these to control the flow of your own programs. Those that concern us the most are the carry, zero, sign, overflow, direction, and interrupt flags. The first three should be pretty obvious, even if you didn’t read the first parts of the series. The interrupt flag is likewise mostly self-explanatory. “Direction” is used for string instructions, which we’ll see later. And the overflow flag indicates that the last operation caused signed overflow based on two’s-complement arithmetic, as in this example:

overflow:
    mov al, 127
    add al, 2
; overflow flag is now set because 127 + 2 = 129,
; which overflows a signed byte (129 ~~ -127)
    add al, 2
; now overflow is clear, because -127 + 2 = -125

The carry, direction, and interrupt flags can be directly altered through code. The CLC, CLD, and CLI instructions clear them, while STC, STD, and STI set them. CMC complements the carry flag, flipping it to the opposite value. You can also use PUSHF to put whole register onto the stack, or POPF to load the flags from there; these instructions weren’t on the original 8086, however.

MOV and addressing

The MOV instruction is the workhorse of x86. It covers loads, stores, and copying between registers, and later extensions have made it Turing-complete in its own right. But in its original form, it wasn’t quite that bad. Plus, it allows all the different addressing modes, so it’s a good illustration of them.

The function of MOV is simple: copy the data in the source to the destination. Despite being short for “move”, it doesn’t do anything to the source data. The source, as for most x86 instructions can be a register, memory location, or an “immediate” value, and the destination can be memory or a register. The only general rule is that you can’t go directly from memory to memory in the same instruction. (There are, of course, exceptions.)

Moving registers (mov dx, ax) and immediate values (mov ah, 04ch) is easy enough, and it needs no further explanation. For memory, things get hairier. You’ve got a few options:

Direct address: a 16-bit value (or a label, in assembly code) indicating a memory location, such as mov ax, [1234] or mov dh, [data].
Register indirect: three registers, BX, SI, and DI, can be used as pointers within a segment: mov al, [bx] loads AL with the byte at location DS:BX.
Indexed: the same registers above, now with BP included, but with a displacement value added: mov al, [bx+4]. (BP is relative to the stack segment, though.)
Base indexed: either BX or BP plus either SI or DI, with an optional displacement: mov [bx+si+2], dx. (Again, BP uses the stack segment, all others the data segment.)

So MOV can do all of that, and that’s before it got expanded with 32-bit mode. Whew. If you don’t like clobbering the old value at the destination, you can use XCHG instead; it works the same way. (Interestingly, the x86 do-nothing instruction NOP is encoded as xchg ax, ax, which really does do nothing.)

Arithmetic and logic

After all the moving around, computing on values is the next most important task. We’ve got most of the usual suspects here: addition (ADD or the add-with-carry ADC); subtraction (SUB or SBB); logical AND, OR, NOT, and XOR (those are their mnemonics, too). There’s also a two’s-complement negation (NEG) and simple increment/decrement (INC, DEC) operations. These all do about what you’d expect, and they follow the same addressing rules as MOV above.

We can shift and rotate bits, as well. For shifting, SHL goes to the left, while SHR or SAR moves to the right; the difference is that SHR always shifts 0 into the most significant bit, while SAR repeats the bit that was already there. (Shifting left, as you probably know, is a quick and dirty way of multiplying by 2.)

Rotating moves the bits that get shifted “off the edge” back around to the other side of the byte or word, but it can optionally use the carry flag as an “extra” bit, so we have four possible permutations: ROL, ROR, RCL, RCR. The “rotate with carry” instructions effectively place the carry flag to the left of the most significant bit. Note that both shifting and rotating can take an immediate value for the number of times to shift, or they can use the value in CL.

A couple of instructions perform sign-extension. CBW takes the top bit in AL and duplicates it throughout AH. CWD works the same way, cloning the high bit of AX into every bit of DX. These are mainly used for signed arithmetic, and the registers they work on are fixed.

Unlike the 6502, the x86 has built-in instructions for multiplication and division. Unlike modern systems, the 16-bit versions are a bit limited. DIV divides either AX by a byte or DX:AX by a word. This implied register (or pair) also holds the result: quotient in AL or AX, remainder in AH or DX. MUL goes the other way, multiplying AL by a byte or AX by a word, and storing the result in AX or DX:AX. Those are more than a little restrictive, and they’re unsigned by design, so we also have IMUL and IDIV. These are for signed integers, and they let you use an immediate value instead: imul ax, -42.

Two other useful instructions can go here. CMP subtracts its source value from its destination and sets the flags accordingly, but throws away the result. TEST is similar, logical-ANDing its operands together for flag-changing purposes only. Both of these are mainly used for conditional flow control, as we’ll see below.

Flow control

We can move data around, we can operate on it, but we also need to be able to change the execution of a program based on the results of those operations. As you’ll recall, the 6502 did this with branching instructions. The x86 uses the same mechanism, but it calls them jumps instead. They come in two forms: conditional and unconditional. The unconditional one, JMP, simply causes the processor to pick up and move to a new location, and it can be anywhere in memory. Conditional jumps are only taken if certain conditions are met, and they take the form Jcc, where cc is a condition code. Those are:

C and NC, for “carry” and “no carry”, depending on the carry flag’s state.
Z and NZ, “zero” and “not zero”, based on the zero flag.
O and NO, for “overflow” and “no overflow”; as above, but for the overflow flag.
S and NS, “sign” and “no sign”, based on the sign flag; “sign” implies “negative”.
B and NB, “below” and “not below”, synonyms for C and NC.
A and NA, “above” and “not above”; “above” means neither the carry nor zero flag is set.
AE, BE, NAE, NBE; the same as the last two pairs, but add “or equal”.
L and NL, “less than” and “not less than”; “less than” requires either the sign or overflow flag set, but not both.
LE and NLE, “or equal” versions of the above.
G, GE, NG, NGE, “greater than”, etc., for the opposites of the previous four.
CXZ and NCXZ, “if CX is/is not zero”, usually used for loops.

These can be confusing, so here are a couple of examples:

mov ax, [value1]
mov dx, [value2]

a_loop:
add ax, dx

; jump if ax > 127,
; otherwise try again
jo end
jmp a_loop

end:
; do something else

mov ax, [bp+4]
cmp ax, 0
; if ax == 0...
jz iszero

; else if ax > 0...
jg ispos

; else if ax < 0...
jl isneg

; or if something went wrong
jmp error

CALL calls a subroutine, pushing a return address onto the stack beforehand. RET is the counterpart for returning from one. INT and IRET work the same way, but for interrupts rather than subroutines; INT doesn’t take an address, but an interrupt number, as we have seen.

A special LOOP instruction allows you to easily create, well, loops. It uses CX as an implicit counter, stopping when it reaches zero. You might use it like this:

; clear the screen
mov ax, 0b800h
mov es, ax
xor di, di  ; quicker clear to 0
mov cx, 80 * 25
mov dl, 20h ; ASCII code for space

nextchar:
mov [es:di],dl
add di,2    ; skip video attribute
loop nextchar

Two variant instructions LOOPZ and LOOPNZ, require that the zero flag be set or cleared, respectively, or they end the loop prematurely.

The stack

All x86 programs have use of a stack, and it’s not limited to 256 bytes like its 6502 cousin. Accessing the stack can’t be done directly in 16-bit land, as there’s no way to address relative to SP, but copying its value into BP and accessing from that is common. But even better are PUSH and POP, which take care of everything for you. They can be used on any register—except that you can’t pop into CS—and even memory; PUSH can also put an immediate value on the top of the stack, though not on the original 8086.

The stack grows “downward”. When a value is pushed onto it, that value is moved into the position pointed to by SP, and SP is decremented by 2. Popping does the opposite. Effectively, the instructions work like this:

do_push:
    mov [sp], value
    sub sp, 2

do_pop:
    mov value, [sp]
    add sp, 2

PUSHA and POPA are shortcuts for pushing all of the main 8 registers, helpful when you need to save state before starting a big subroutine.

Strings

The x86 can’t really work on strings, but it can work with arrays of bytes or 16-bit words using simple instructions. This is done through five instructions that operate on either bytes or words; NASM requires a suffixed “B” or “W” for these, but I’ll refer to them with a general “x”.

In all these cases, the “source” address is, by default, DS:SI, and the destination is ES:DI. Also, because these instructions were meant to be done in blocks, they can take a special REP prefix. This works a bit like LOOP, in that it stops when CX reaches 0. (REPE and REPNE are also available, and they work like LOOPZ and LOOPNZ.) After the instruction performs its operation, it increments both SI and DI by 1 for the byte version, 2 for the word version. This is where the direction flag comes into play, however: if it’s set, these instructions instead subtract 1 or 2 from those registers, effectively performing the operation in reverse.

LODSx and STOSx load and store, respectively. LODSx puts the value at [DS:SI] into AL or AX, while STOSx moves AL or AX into memory at [ES:DI]. Either way, they then change the index register (SI or DI) as described above. REP doesn’t really make sense with these, but they can work in hand-rolled loops.

MOVSx is a little more general, and it’s one of the few memory-to-memory operations available on the early x86. It copies a byte or word at [DS:SI] to [ES:DI], then changes both index registers based on the data width (1 for byte, 2 for word) and the direction flag (up for cleared, down for set). It’s all but made for block copying, as in this example:

; assumes SI and DI point to appropriate memory areas
; and CX holds a count of bytes to move
memcpy:
rep movsb
ret

CMPSx compares bytes or words, setting the flags accordingly. It could be used to implement a string comparison function like so:

; assumes SI and DI point where they should,
; and CX contains the max number of characters to test
; returns a value in AL:
; -1 if the "source" (1st) string is less,
; +1 if it's greater,
; 0 if they're equal
strncmp:
xor al, al
repe cmpsb
jg greater
dec al  ; sets to FFh, or -1
jmp exit

greater:
inc al  ; sets to 01h

ret

Finally, SCASx sets the flags based on a comparison between AL (for bytes) or AX (for words) and the value at [ES:DI]. The mnemonic stands for “scan string”, and that’s what it can do:

; assumes DI points to a string,
; CX holds the length of the string,
; and AL holds the character to search for
; returns in AX:
; position of found character, or -1 if not found

contains:
mov dx, cx
repne scasb
jncxz found

; character not found, since we ran out of string
mov ax, 0ffffh
jmp end

found:
; CX now holds the number of characters from string end,
; but we saved the original length in DX
; thus, the position is DX - CX - 1
inc cx
sub dx, cx
mov ax, dx

end:
ret

Input and output

Input and output send bytes or words between registers and the I/O space. This is a special set of 64K (65,536) memory locations, though only the first 1,024 were used on early PCs. Using them involves the IN and OUT instructions. These are fairly restrictive, in that they imply the AL or AX register for the data and DX for the I/O port: in ax, dx or out dx, al. However, for the “low” ports, those with addresses up to 0xff, you can instead use an immediate version: in al, 40h.

The 286 added in string I/O with the INSx and OUTSx instructions. These work similarly to LODSx and STOSx above, but the data is either coming from or going to an I/O port instead of main memory. (This was a bit faster than doing a manual loop, and some early peripherals actually couldn’t handle that!) The port number is always in DX, while [DX:SI or [ES:DI] is the data pointer, as above.

Enough for now

And we’re finally done. Next time, we can start programming this thing, but this post is already way too long, so I’ll see you later.

Assembly: the standard machine

This crazy thing we call the PC has been around for over 30 years now, and it’s quite far from its original roots as IBM’s business-oriented machine. The numbers involved are mind-boggling in their differences. The first PC ran at 4.77 MHz, while today’s can top 4 GHz. (And, thanks to parallelization, cache, and all those other advances we’ve seen, that adds up to far more than a 1000x increase in computing power.) Intel’s 8086 processor could address 1 MB of memory; modern computers have multiple megabytes inside them. Back then, the primary medium for storage was the floppy disk, and the PC’s disks weighed in at a hefty 360 KB; nowadays, you can find USB flash drives capable of holding more than a million times that.

And people still managed to get things done with machines that are easily outdone by our watches, our refrigerators, our thermostats. A lot of people today can’t even comprehend that, including many—I know quite a few—who used to get by without computers at all! But histories of the PC are everywhere, and we have a specific reason for our trek down memory lane. So let’s take a look at these ancient contraptions from the assembly programmer’s point of view.

The ideal PC

Since we’re using an emulator for our programming, we don’t need to go into a lot of hardware specifics. It doesn’t particularly matter for the purposes of this series that the 8088 had only an 8-bit bus, or that the 486SX lacked a built-in floating-point unit. (I had one of those on my first PC. It sucked.) From our level, those details are mostly irrelevant. What does matter is, for lack of a better term, the hardware interface.

For this series, I’m assuming an “ideal” PC loosely based on real ones. It’s not an original PC, nor is it an XT, AT, or something later. It’s really PC compatible, but nothing more. It roughly conforms to the specs, though some things might be different. That’s intentional, and it’s for ease of explanation. (It also means I can reuse knowledge from all those old DOS tutorials.)

Our idealized machine will use a 286-level processor running in real mode. That lets us focus on a simplified architecture (no protected mode, no 32-bit stuff) that isn’t too far off from what people were using in the late 80s. It does mean limitations, but we’ll work around them.

We’ll also assume that we have access to VGA graphics. That was not the case for the early 286-based PCs; they had to make do with EGA and its 16 colors, at best. Again, this makes things easier, and you’re free to look up the “real thing” if you wish.

Finally, we’re using FreeDOS in the v86 emulator until further notice. It’s not exactly Microsoft’s old DOS from the days of yore, but it’s close enough for this series. Basically, assume it has the features a DOS is supposed to have.

Now that that’s out of the way, let’s see what we have to work with.

Memory

We effectively have a 286 processor, which we looked at last week. Since we’re ignoring protected mode entirely, that means we have access to a total of 1 MB of memory. (Note that this is a real megabyte, not the million bytes that hard drive manufacturers would have you believe.) That memory, however, is divided up into various portions.

The first 1 KB is reserved by the CPU as the interrupt vector table. The x86 allows 256 different interrupts, whether caused by hardware or software. This table, then, holds 256 4-byte addresses, each one a pointer to an interrupt handler. For example, the core DOS interrupt, 21h, causes execution to jump to the interrupt routine whose address is stored at 0x21 * 4 = 0x84, or 0000:0084.
The next 256 bytes, starting at 0040:0000, are the BIOS data area. (I’ll explain the BIOS in a bit.) Much of this space has special meaning for the BIOS, so it’s a bit like the zero page on most 6502 computers.
DOS, the BIOS, and the PC all use bits of the next 256 bytes, starting at 0050:0000.
DOS effectively begins loading at 0060:0000, though how much memory it uses from here depends on a wide variety of factors.

Whatever is left after this is available for program use, up to the PC’s RAM limit. Everything else in the address space, starting at A000:0000, is given to the various types of adapters supported by the architecture. Even on later systems with expanded memory capabilities, this conventional memory limit of 640 KB remained. Yes, the original PC was limited to 640K of RAM—leading to the famous quote Bill Gates may or may not have uttered—but this wasn’t as stringent as it seems; the first IBM models only had 256K, and that was still four times what the 6502 could understand.

The remaining 384 KB of the old PC’s range was for the display, the BIOS, and any expansions that may have been around:

A 64K block starting at a000:0000 is the video buffer for color graphics modes, including the VGA-alike our idealized PC uses. (Modern graphics cards use high-speed direct memory accesses, but this area is still there for “fallback” mode.)
The next 32K, from b000:0000, is used by monochrome video adapters for text. Some systems that neither had nor supported monochrome used this space as extra memory.
Another 32K block, starting with b800:0000 or b000:8000 (they’re the same thing), is a very big deal. It’s where the color text-mode memory lies, so manipulating this area will directly affect the contents of the screen. It’s a linear array of words: the low byte holds the character, the high byte its foreground and background colors. We haven’t seen the last of this space.
Starting at c000:0000, things get murkier. Usually, the 32K starting here is where the video BIOS lives. After that is the hard disk BIOS, if there is any. Everything from d000:0000 to f000:0000 was intended for expansion; add-ons were expected to set themselves up at a particular place and define how much memory they were going to use. Of course, conflicts were all too easy.
The last big block of memory begins at f000:0000. It’s meant for the BIOS, though the original one only used the last 8K (f000:e000 and up). By the way, the processor starts at f000:fff0, which is why the first BIOS went at the end of the block.
Finally, the last segment, ffff:xxxx, wraps around to zero on an 8086, but it can access a little bit of higher memory on the 286 and later. That’s the high memory area that DOS users learned so much about back in the day. It won’t really concern us much, but it’s good to know it exists, especially if you’re reading older literature that refers to it.

The BIOS

BIOS (Basic Input/Output System) is the name given to the PC’s internal code, as well as the programmer-accessible interfaces that code provides. The BIOS is mainly responsible for taking a bare x86 processor and turning it into a usable system; in that, it’s like the system ROMs of older computers. Different BIOSes over the years have offered numerous features—some of them can now run entire operating systems, while Coreboot is an OS—but they all also contain the same basic functions that you had 30 years ago.

Rather than using a vector table, like that of a Commodore 64, the PC BIOS provides access to its API through interrupts. By placing certain values in registers before invoking the interrupt, you can select different functions and pass arguments to them. Return values would likewise be placed in registers, sometimes the same ones. As an example, BIOS interrupt 10h controls the video system, and one of its available functions is setting the cursor position. To do that, we need to load the number 2 into AH, among other things, as in this snippet:

curs_set:
    mov ah, 2   ; function number
    mov bh, 0   ; "page number": 0 is the "main" page

    ; these would probably be previously defined
    mov dh, row
    mov dl, column

    int 10h

We’ll be seeing the BIOS later on, but we definitely won’t meet more than a fraction of its interface.

Peripherals and I/O

The PC was designed with peripherals in mind. Floppy disks, hard disks, printers, and many other things came to be attached to these computers. Usually, they came with drivers, and the BIOS could talk to some of them, but programmers occasionally needed to access them directly. Nor were these methods mutually exclusive. After all, it was entirely common to bypass the BIOS and write directly to video memory, because that was far faster.

Under our assumptions, we’ve got a VGA graphics adapter. We’ve got our keyboard and mouse, which the emulator…well, emulates. The FreeDOS image we’ll be using includes one floppy disk, but that’s it. Other than that, we have the PC speaker and a few other odds and ends. Not much, but plenty to explore. But that’s for later.

The operating system

Today, we use Windows or Linux or OSX or whatever, and we only really think about the operating system when it won’t work. But almost every home PC of the late 80s ran DOS first, and that’s what people learned. FreeDOS is like an extension of that, an upgraded version that can run on modern machinery. It uses the same interfaces for “legacy” code, however, and that’s what we want. And v86 includes a preset FreeDOS disk image for us, so that’s a bonus.

DOS, as derided as it is today, actually packed a lot of functionality into its tiny innards. Sure, it didn’t have true multitasking, and you needed hacks to use more than 1 MB of memory. Some of its internal functions were slow enough that it was better to write to the hardware directly. And it even left some things up to the BIOS. But it was far better than nothing.

I could make this a “write a 16-bit OS” series, but I won’t. Because I don’t have to. That’s one of the few good things about DOS: it’s so simple that you can almost ignore it, but you can still use it when it’s what you need. So that’s what I’ll be doing. I mean, it worked for fifteen years, right?

And so on

There’s lots more I could say about the PC platform, whether the real thing or our nostalgic remembrance. But this post is already getting too long. Everything else can come up when it comes up, because now we have to move on to the whole reason for this series: the x86 assembly language. Next time, you’ll see what it looked like (not what it looks like now, though, because it’s totally different). Maybe we can even figure out why so many people hated it with such passion. Oh, and we’ll have some actual code, too.

Assembly: the architecture

The x86 gets a lot of hate for its architecture. In modern days, it’s really more out of habit than any technical reason, but the 16-bit variant really deserved some of its bad reputation. Other parts, in my opinion, were overstated. Hyperbole then became accepted wisdom, and positive reinforcement only made it grow.

But I’m not here to apologize for the x86. No, this post has a different purpose. It’s an overview of the x86 architecture from the point of view of an assembly programmer circa 1991. Although the 386 had been out for a few years, older processors were still around, and not too much was making full use of the 32-bit extensions that the 386 brought. DOS, particularly, remained 16-bit, though there were extensions to address more memory. For the most part, however, we’ll stick to “real mode”, like it was in the olden days.

The CPU

An x86 processor of this vintage was a lot less complex than today’s Intel i7 or AMD A10, and not just because it didn’t include integrated graphics, megabytes of on-die cache, power management functions, and so on. Later lines have also added lots of assembly-level goodness, like MMX and SSE. They’re 64-bit now, and that required the addition of “long mode”.

But let’s ignore all that and look at the “core” of the x86. There’s not that much to it, really. In the original conception, you have about a dozen programmer-accessible registers, all of which started out 16 bits wide, but exactly one of these is truly general-purpose. The registers can be divided into a few categories, and we’ll take each of them in turn.

General registers

These are the “big four” of the x86: AX, BX, CX, and DX. As I said last time, all four can also be accessed as a pair of byte-sized registers. The high bytes are identified by the first letter of the register name followed by H, while the low bytes use L. So we have AL or BH or whatever. It doesn’t actually increase the number of registers we have, but sometimes the savings from loading only a single byte can add up. Remember, older computers had less memory, so they had to use it more wisely.

Each of the four 16-bit general registers is used in its own special way. AX is the accumulator (like the A register from the 6502), and it’s usually the best for arithmetic; some instructions, like the multiplication instruction MUL, require it. BX is used as a “base” for a few addressing-type instructions. CX is semi-reserved as a loop counter. And DX is sometimes taken as an “extension” of AX, creating a kind of 32-bit register referred to as DX:AX.

Of course, if you’re not using the instructions that work on specific registers, you can do what you like with these. Unlike the 6502, where almost everything involved a memory access, x86 does let you work register-to-register. (On the other hand, it doesn’t have cheap access to the zero page, so there.) You can add AX to BX, for instance, and no one will care.

Pointer registers

The other four main registers all have something to do with pointers and addressing. You can use them as scratch space for arithmetic, but a lot of instructions assume they hold addresses. Unlike the general registers, all these are only 16-bit. (Modern systems do give you special access to the low byte of them, however.)

SP is the big one out of this group: the stack pointer. Stacks are a lot more important on the x86 than the 6502, mainly because that’s where you put your “extra” data that won’t fit in registers. But programmers usually don’t manipulate SP directly. They instead pop and push (note the terminology change from 6502), and those instructions change SP as needed. BP is an extra pointer register mostly used by languages like C to access stack “frames”, but assembly programmers can turn it into a general pointer.

The other two come in a pair: SI and DI. These stand for “source index” and “destination index”, respectively, and the processor uses them for certain load and store operations. Quite a few of the DOS and BIOS APIs expect them to hold pointers to input and output parameters. And on an early x86, they were the best option for indirect addressing, a bit like the 6502’s X and Y registers.

System registers

The instruction pointer, IP, controls the execution of code. It’s not directly accessible by programmers; instead, you change it through branching (jumping, in x86 parlance) and subroutine calls. In other words, you can mostly act like it’s not there.

The register that holds the flags, usually called FLAGS when it needs a name, also can’t directly be read from or written into. You can push it to the stack, however, then manipulate it from there, but the main three flags (carry, direction, and interrupt) have special instructions to set and clear them, similar to the 6502.

While the x86 has quite a few more flags than the 6502, most of them aren’t too important unless you’re delving deep into an operating system’s internals. The main ones to know about are the carry, zero, sign, direction, overflow, and interrupt flags. Most of them should be self-explanatory, while “overflow” works in a similar fashion to its 6502 counterpart. The direction flag is mostly used for string-handling instructions, which we’ll see in a later post.

One more register deserves a brief mention here. On the 286, it’s called the MSW, or “machine status word”. After that, it gets the official designation CR0. It’s used to control internal aspects of the processor, such as switching between real and protected modes or emulating a floating-point unit. I can’t think of a case where this series would use it, but now you know it’s there.

Segment registers

And then we come to the bane of many an assembly programmer, at least those of a generation or two ago: the segment registers. We’ll look at the x86 memory model in a moment; for now, just think of segments as something like overlapping banks of memory.

We’ve got four segment registers, all 16 bits wide even in 32-bit mode, for the code, data, stack, and extra segments. Their mnemonic names, conveniently enough, are initialisms: CS, DS, SS, and ES. CS points to the segment where execution is occurring; you can’t change it except with a “far” call, but you can read from it. SS holds the segment address of the stack, but you probably figured that one out already. DS is the default for reading and writing memory, while ES, as its name suggests, is for whatever you like.

Segment registers are weird. You can move values to and from them (except into CS, as I said), but you can’t operate on them. What you can do, however, is use them to “override” an address. For example, loading a value from memory uses DS as its base, but you can make it use ES instead: mov ax, [es:di] loads the value pointed to by DI, but in the ES segment.

Memory model

And that leads us to the x86 memory model. It’s a bit convoluted, since the original 8086 was designed as a 16-bit system that could address 1 MB of memory. Something had to give, but Intel took a…nonstandard approach.

Every address on the x86 has two parts: segment and offset. (This is true even on today’s processors, but 64-bit mode is hardwired to treat all segments as starting at address 0.) In real mode, as with an older x86 running DOS, an actual memory address can be obtained by shifting the segment 4 bits to the left and adding the offset. Or, to put it in code: address = (segment << 4) + offset. Each segment, then, can address a 64K block of memory in a fashion much like banking in the 8-bit world.

The difference between one segment and the next is only 16 bytes, thanks to the 4-bit shift. That means that segments will overlap. The addresses b000:8123 and b800:0173, for example, refer to the same memory location: 0xb8123. In practice, this doesn’t matter too much; the segment portion is mostly used as a base address, while the offset is, well, an offset.

In protected mode, throw those last paragraphs out. Segments instead are indexes into a lookup table, creating a virtual memory system that essentially went unused until its mercy-killing by AMD when they brought out the x86-64. (The segment-based virtual memory scheme of protected mode, interesting though it seemed, was basically an exercise in the knapsack problem.) We won’t be worrying about protected mode much, though, so let’s move on.

Back to real mode, the world of DOS and early x86s. A little bit of arithmetic shows that the segment:offset addressing method allows access to 1 MB of memory, more or less. 0000:0000 is, of course, address 0, and it’s the lowest possible value. The highest possible is ffff:ffff, and that presents a problem. Add it up, and you get 0x100fef. On old systems, this simply wrapped around the 1 MB “barrier”, to 0x00fef. (Once memory sizes expanded to multiple megabytes, it no longer overflowed, but some programs relied on that behavior, so a hardware hack was put into place. It’s called the A20 gate, and it was originally put in the keyboard controller, of all places. But I digress.)

Input/output

Also available to the x86 are the I/O ports. These are accessed using the IN and OUT instructions, a byte, word, or (in 32-bit mode) double-word at a time. They function like their own little address space, separate from main memory. The x86 architecture itself doesn’t really define which ports do what. That’s left to the PC platform—which will be the subject of the next post.

Modern operating systems also allow memory-mapped I/O access, but we’ll leave that alone for the time being. It’s far more useful when you go beyond the bounds of the 16-bit world.

Interrupts

Like the 6502, the x86 has support for interrupting the processor’s normal flow, but it goes about it in a different way. An interrupt can be caused by hardware; the old term IRQ, “interrupt request”, referred to this. But software can also directly invoke interrupts. As we saw last week, that’s how DOS implemented its API.

In real mode, the result is the same either way. The processor jumps to a specific memory location and executes a bit of code, then returns to what it was doing like nothing happened. We’ll see the details later on, but that’s the gist of it.

To be continued

So I’ve rambled on long enough for this post. Next up will be a look at the PC platform itself, at least as it stood a quarter century ago. That’ll be a look into deep history for many, but the choices made then still affect us today. Following that will be the dive into the deep end of “old-school” x86 assembly.

Assembly: the precursor

Before Windows was a thing, there was DOS. Before today’s 64-bit processors, or even their 32-bit predecessors, there was the 16-bit world. In a way, 16-bit DOS is a bridge between the olden days of 8-bit microprocessors like the 6502 and the modern behemoths of computing, like the A10 in the PC I’m writing this on, or the Core i5 in the next room.

A while back, I started writing a short series of posts about assembly language. I chose the 6502, and one of the reasons why was the availability of an online assembler and emulator. Very recently, a new challenger has come into this field, Virtual x86. Even better, it comes with a ready-to-go FreeDOS image with the NASM assembler pre-installed. Perfect.

So, I’m thinking about reviving the assembly language series, moving to the (slightly) more modern architecture of 16-bit x86 and DOS. It’s a little more relevant than the 6502, as well as much more forgiving, but the fundamentals of assembly are still there: speed, size, power. And, since 16-bit code doesn’t run at all on 64-bit x86 CPUs, we don’t have to worry as much about bad habits carrying over. The use of DOS (FreeDOS, specifically) helps, too, since essentially nothing uses it these days. Thus, we can focus on the code in the abstract, rather than getting bogged down in platform-specific details, as we’d have to do if we looked at “modern” x86.

A free sample

Assembly in this old-school fashion is fairly simple, though not quite as easy as the venerable 6502. Later on, we can delve into the intricacies of addressing modes and segment overrides and whatnot. For now, we’ll look at the 16-bit, real-mode, DOS version of everyone’s first program.

org 100h

    mov dx, data
    mov ah, 09h
    int 21h
    mov ah, 04ch
    int 21h

data:
    db 'Hello, World!$'

All this little snippet does is print the classic string to the screen and then exit, but it gives you a good idea of the structure of x86 assembly using what passes for the DOS API. (Oh, by the way, I copied the code from Virtual x86’s FreeDOS image, but I don’t see any other way you could write it.)

Here are the highlights:

org 100h defines the program origin. For the simplest DOS programs (COM files), this is always 100h, or 0x100. COM files are limited to a single 64K segment, and the first 256 bytes are reserved for the operating system, to hold command-line arguments and things like that.
The 16-bit variety of x86 has an even dozen programmer-accessible registers, but only four of these are anywhere close to general-purpose. These are AX, BX, CX, and DX, and they’re all 16 bits wide. However, you can also use them as byte-sized registers. AH is the high byte of AX, AL the low byte, and so on with BH, BL, CH, CL, DH, and DL. Sometimes, that’s easier than dealing with a full word at a time.
mov is the general load/store instruction for x86. It’s very versatile; in fact, it’s Turing-complete. Oh, and some would say it’s backwards: the first argument is the destination, the second the source. That’s just the way x86 does things (unless you’re one of those weirdos using the GNU assembler). You get used to it.
int, short for “interrupt”, is a programmatic way of invoking processor interrupts. The x86 architecture allows up to 256 of these, though the first 16 are for the CPU itself, and the next are taken up by the BIOS. DOS uses a few of its own for its API. Interrupt 0x21 (21h) is the main one.
Since there are only 256 possible interrupts and far more useful operations, the OS needs some way of subdividing them. For DOS, that’s what AH is for. A “function code” is stored in that register to specify which API function you’re calling. The other registers hold arguments or pointers to them.
Function 0x09 (09h) of interrupt 0x21 writes a string to the console. The string’s address is stored in DX (with some segment trickery we’ll discuss in a later post), and it must end with a dollar sign ($). Why? I don’t know.
Function 0x4c (04ch) exits the program. AL can hold a return code. Like on modern operating systems, 0 is “success”, while anything else indicates failure.
db isn’t an actual assembly instruction. Much like in 6502-land, it defines a sequence of literal bytes. In this case, that’s a string; the assembler knows how to convert this to an ASCII sequence. (“Where’s Unicode?” you might be wondering. Remember that DOS is halfway to retirement age. Unicode wasn’t even invented before DOS was obsolete.)

Going from here

If you like, you can run the FreeDOS image on Virtual x86. It comes with both source and executable for the above snippet, and the original source file even includes compilation directions. And, of course, you can play around with everything else the site offers. Meanwhile, I’ll be working on the next step in this journey.

Assembly: optimization in the past and present

In this post, I won’t be discussing assembly language in any depth. Rather, I want to focus on one of the main reasons to use assembly: optimization. Actually, it might be the main reason today, because there’s not much need for assembly coding these days; it’s only when we want something to be as fast as possible that it comes into play.

Also, I’m moving away from the 6502 for this one, instead using the x86 architecture for my examples. Why? Because x86 is still the leading processor family for desktops, and so much has been written about it over the decades. There’s a lot of optimization info out there, far more than for just about any other architecture. Yes, ARM is growing, especially in the lower-end part of the market where assembly can still be very useful, but ARM—due to its very nature—is so fragmented that it’s hard to give concrete examples. Also, because x86 is so long-lived, we can trace the development of various processor features through its evolution. For that, though, we’ll need a bit of a history lesson.

Starting gates

The first microprocessors, from a bird’s-eye view, weren’t exactly complicated in function. They took an instruction from memory, decoded it, executed it, then moved on, sometimes jumping around the executable code space. Decoding each instruction and performing it were the only real hard parts. That’s one reason why RISC was so highly touted, as the smaller, more fundamental instruction set required less chip space for decoding and execution. (Look up something like VAX assembly for the opposite—CISC—end of the spectrum.)

Fetching the instruction was a simple load from memory, something every processor does as a matter of course. Decoding required a major portion of the circuit (the 6502 used a programmable array a bit like a modern FPGA, except that its function was fixed in the factory) but a comparatively small portion of processor time. Executing could require more memory accesses for the instruction’s operands, and it could take a lot of time, especially for something complex like multiplication—an ability the 6502, among others, lacks.

The birth of parallelism

But memory has always been slower than the processor itself. On all but the most complicated instructions, memory access takes the most time of any phase of execution. Thus, the prefetch queue was born. In a sense, this was the forerunner of today’s cache. Basically, it tried to predict the future by fetching the next few bytes from memory. That way, the large time constants required for RAM access could be amortized.

The problem with the prefetch queue, as with all cache, comes with branching. Branches, as we saw in earlier posts, are the key to making decisions in assembly language. But they force the processor to jump around, instead of following a linear path through memory. A branch, then, negates the advantage of the prefetch queue.

Processor designers (and assembly programmers) tried a few methods of working around the cost of branching. That’s why, at a certain time long ago, loop unrolling was considered a very important optimization technique. If you need to run a particular group of instructions, say, ten times, then it was a bit faster to “copy and paste” the assembly instructions than it was to set up a branching loop. It used more space, but the speed gains made up for that.

Another optimization trick was rearranging the branch instructions so that they would fail more often than not. For example, the x86 has a pair of instructions, JZ and JNZ, that branch if the zero flag is set or clear, respectively. (This is equivalent to the 6502’s BEQ and BNE, except that the x86 has more registers and instructions that can change it.) If you have a section of code that is run only when an argument is 0, and 0 rarely shows up, the naive way of writing it would be to skip over that section with a JNZ. But it might be faster (on these earlier processors, at least) to put the “only if 0” code at the end of the subroutine (or some other place that’s out of the way) and use JZ to branch to it when you need it.

In the pipeline

Eventually, the interests of speed caused a fundamental shift in the way processors were made. This was the birth of the pipeline, which opened a whole new world of possibilities, but also brought new optimization problems. The prefetch queue described above was one of the first visible effects of pipelining, but not the last.

The idea of a pipeline is that the processor’s main purpose, executing code, is broken into smaller tasks, each given over to a dedicated circuit. These can then work on their own, like stations on an assembly line. The instruction fetcher gets the next instruction, passes it on to the decoder, and so on. A well-oiled machine, in theory. In practice, it’s hard to get all the parts to work as one, and sometimes the pipeline would be stalled, waiting on one part to finish its work.

The beauty of the pipeline is that each stage is distinctly ordered. Once an instruction has been retrieved, the fetcher isn’t needed, so it can do something else. Specifically, it can fetch the next instruction. If the timing works out, it can fill up the prefetch queue and keep it topped off when it has the free time.

Fortune-telling

But branches are the wrenches in the works. Since they break the linear flow of instructions, they force the pipeline to stall. This is where the processor designers had to get smart. They had to find a way of predicting the future, and thus branch prediction was popularized.

When it works, branch prediction can completely negate the cost of a conditional jump. (Of course, when it fails, it stalls the whole thing, but that’s no worse than not predicting at all.) From an assembly language point of view, it means that we could mostly ditch the clever tricks like loop unrolling and condition negation. They would still have their uses, but they wouldn’t need to be quite so common. That’s a good thing, because the extra code size brought by loop unrolling affected another part of these newfangled processors: the cache.

Cache really came about as another way to make memory access faster. The method was a little roundabout, but it worked, and cache has stuck with us through today. It’s only getting bigger, too; most of the physical space on today’s processors is, in fact, cache memory. Many chips actually have more memory on the die than the 4 MB my first PC had in total.

The trick to cache comes from looking at how code accesses memory. As it turns out, there’s a pattern, and it’s called the principle of locality. Put simply, reading one memory location is a pretty good indicator that you’re going to be reading the ones right next to it. If we could just load all of those at once, then we’d save a ton of time. So that’s what they did. Instead of loading memory a byte or a word at a time, they started pulling them in 16 or more at once. And it was faster, but only while you stayed in the region of memory loaded into the cache.

Soon, cache became not fast enough, not big enough, and they had to find ways to fix both of these problems. And that’s where we are today. Modern x86 chips have three levels of cache. The first, L1, is the smallest, but also the fastest. L2 cache is a bit slower, but there’s more of it. And L3 is the slowest (though still faster than RAM), but big enough to hold the entirety of, say, Windows 95.

The present state of affairs

So now the optimization strategy once again focuses on space. Speed is mostly a non-factor, as the desktop x86 processors can execute most of their instructions in a single clock cycle, branch prediction saves us from the cost of jumps, and huge amounts of cache mean fewer of the horrifically slow memory accesses. But cache is limited, especially the ultra-fast L1. Every instruction counts, and we want them to all be as close together as possible. (Data is the same way, but we’ll ignore it for now.) Unrolling loops, for example, is a waste of valuable cache.

A few other optimizations have been lost along the way, made obsolete by the march of progress. One of my favorite examples is that of clearing a register. It’s a common need in assembly language, and the favored method of doing it early in the x86 days was by using the XOR instruction, e.g., XOR AX, AX. Exclusive-OR, when given the same value twice, always produces 0, and this method was quicker (and shorter) than loading an immediate 0 value (MOV AX, 0).

The self-XOR trick was undone by a combination of factors. The first was register renaming, which essentially gave the processor a bunch of “virtual” registers that it could use as internal scratch space, moving data to and from the “real” ones as needed. The second was out-of-order execution, and that takes a little more explaining.

If you’ve ever looked at the optimized assembly output of your favorite high-level compiler (and you should, at least once), then you might have noticed that things aren’t always where you put them. Every language allows some form of rearranging, as long as the compiler can prove that it won’t affect the outcome of the program. For example, take a look at this C snippet:

int a, d;
int b = 4;
int c = 5;
for (a = 0; a < b; a++) {
    f(a);
}
d = b * c;

The final statement, assigning the value 20 to d, can be moved before the loop, since there’s no way that loop can change the value of b or c; the only thing the loop changes, a, has nothing to do with any other variable. (We’re assuming more code than this, otherwise the compiler would replace b, c, and d all with the simple constant 20.)

A processor can do this on the assembly level, too. But it has the same restriction: it can only rearrange instructions if there are no dependencies between them. And that’s where our little XOR broke. Because it uses the same register for source and destination, it created a choke point. If the next few instructions read from the AX register, they had to wait. (I don’t know for sure, but I’ve heard that modern processors have a special case just for self-XOR, so it lives again.)

Ending the beginning

This has been a bit of a rambling post, I’ll admit. But the key point is this: optimization is a moving target. What might have worked long ago might not today. (There’s a reason Gentoo is so fast, and it’s because of this very thing.) As processors advance, assembly programmers need new tricks. And we’re not the only ones. Compilers produce virtually all assembly that runs these days, but somebody has to give them the knowledge of new optimization techniques.

Assembly: a little bit more

Well, I’m back. Instead of giving you more apologies for missing a couple of weeks of this exciting series (sarcasm alert!), let’s jump right back in and look at some more old-school assembly language. This week, we’ll get to know homebrewed 6502 versions of a couple of C standard library staples, and we can start talking about how you use data structures in assembly.

Memory and data

The simplest, dumbest (for the computer, not the programmer) way to treat data is as raw memory. The problem is, there’s not much you can do with it. You can initialize it, copy it around, and that’s about it. Everything else needs some structure. Copying in assembly language is pretty easy, though, even in 6502-land:

; Arguments:
; $F0-$F1: Source address
; $F2-$F3: Destination address
; Y register: Byte count
memcpy:
    lda ($F0), Y    ; 2 bytes, 5 cycles
    sta ($F2), Y    ; 2 bytes, 6 cycles
    dey             ; 1 byte, 2 cycles
    bne memcpy      ; 2 bytes, 2-3 cycles
    rts             ; 1 byte, 6 cycles

Yep, this is a stripped-down version of memcpy. It has its limitations—it can only copy a page of memory at a time, and it has no error checking—but it’s short and to the point. Note that, instead of a prose description of the subroutine’s arguments and return values and whatnot, I’m just putting that in the comments before the code. I trust that you can understand how to work with that.

Since the code is pretty self-explanatory, the comments for each line show the size and time taken by each instruction. A little bit of addition should show you that the whole subroutine is only 8 bytes; even on modern processors, the core of memcpy isn’t exactly huge.

The timing calculation is a little more complex, but it’s no less important on a slow, underpowered CPU like the 6502. In the case of our subroutine, it depends on how many bytes we’re copying. The core of the loop will take 13 cycles for each iteration. The branch instruction is 3 cycles when the branch is taken, 2 cycles when it’s missed. Altogether, copying n bytes takes 16n+5 cycles, a range of 21 to 4101. (A zero byte count is treated as 256.) In a modern computer, four thousand cycles would be a few microseconds at most. For the 6502, however, it’s more like a few milliseconds, but it’s hard to get faster than what we’ve got here.

Strings

The first way we can give structure to our data is with strings. Particularly, we’ll look at C-style strings, series of bytes terminated by a null value, hex $00. One of the first interesting operations is taking the string’s length—the C standard library’s strlen—and this is one implementation of it in 6502 assembly:

; Arguments:
; $F0-$F1: String address
; Returns:
; A: Length of null-terminated string
strlen:
    ldy #$00        ; 2 bytes, 2 cycles
    clv             ; 1 byte, 2 cycles
  slloop:
    lda ($F0), Y    ; 2 bytes, 5 cycles
    beq slend       ; 2 bytes, 2-3 cycles
    iny             ; 1 byte, 2 cycles
    bvc slloop      ; 2 bytes, 3 cycles
  slend:
    tya             ; 1 byte, 2 cycles
    rts             ; 1 byte, 6 cycles

All it does is count up through memory, starting at the pointed-to address, until it reaches a zero byte. When it’s done, it gives back the result in the accumulator. Now, this comes with an obvious restriction: our strings can’t be more than 255 bytes, or we get wraparound. For this example, that’s fine, but you need to watch out in real code. Of course, in modern processors, you’ll usually have at least a 32-bit register to work with, and there aren’t too many uses for a single string of a few billion bytes.

Our assembly version of strlen weighs in at 12 bytes. Timing-wise, it’s 12n+20 cycles for a string of length n, which isn’t too bad. The only real trickery is abusing the overflow flag to allow us an unconditional branch, since none of the instructions this subroutine uses will affect it. Using a simple JMP instruction is equivalent in both time and space, but it means we can’t relocate the code once it has been assembled.

Another common operation is comparing strings, so here’s our version of C’s strcmp:

; Arguments:
; $F0-$F1: First string
; $F2-$F3: Second string
; Returns comparison result in A:
; -1: First string is less than second
; 0: Strings are equal
; 1; First string is greater than second
strcmp:
    ldy #$00        ; 2 bytes, 2 cycles
  scload:
    lda ($F0), Y    ; 2 bytes, 5 cycles
    cmp ($F2), Y    ; 2 bytes, 5 cycles
    bne scdone      ; 2 bytes, 2-3 cycles
    iny             ; 1 byte, 2 cycles
    cmp #$00        ; 2 bytes, 2 cycles
    bne scload      ; 2 bytes, 2-3 cycles
    lda #$00        ; 2 bytes, 2 cycles
    rts             ; 1 byte, 6 cycles
  scdone:
    bcs scgrtr      ; 2 bytes, 2-3 cycles
    lda #$FF        ; 2 bytes, 2 cycles
    rts             ; 1 byte, 6 cycles
  scgrtr:
    lda #$01        ; 2 bytes, 2 cycles
    rts             ; 1 byte, 6 cycles

Like the its C namesake, our strcmp doesn’t care about alphabetical order, only the values of the bytes themselves. The subroutine uses just 24 bytes, though, so you can’t ask for too much. (Timing for this one is…nontrivial, so I’ll leave it to the more interested reader.)

Other structures

Arrays, in theory, would work almost like strings. Instead of looking for null bytes, you’d have an explicit count, more like newer C functions such as strncmp. On the 6502, the indirect indexed addressing mode (e.g., LDA ($F0), Y) we’ve used in every example so far is your main tool for this. Other architectures have their own variations, like the x86’s displacement mode.

More complex structures (like C structs or C++ classes), are tougher. This is where the assembly programmer needs a good understanding of how high-level compilers implement such things. Issues like layout, padding, and alignment come into play on modern computers, while the 6502 suffers from the slower speed of indirection.

Self-contained structures (those that won’t be interfacing with higher-level components) are really up to you. The most common layout is linear, with each member of the structure placed consecutively in memory. This way, you’re only working with basic offsets.

But there’s a problem with that, as newer systems don’t really like to access any old byte. Rather, they’ll pull in some number of bytes (always a power of 2: 2, 4, 8, 16, etc.) all at once. Unaligned memory accesses, such as loading a 32-bit value stored at memory location 0x01230001 (using x86-style hex notation) will be slower, because the processor will want to load two 32-bit values—0x01230000 and 0x01230004—and then it has to do a little bit of internal shuffling. Some processors won’t even go that far; they’ll give an error at the first sign of an unaligned access.

For both of these reasons, modern languages generate structures with padding. A C struct containing a byte and a 32-bit word (in that order), won’t take up the 5 bytes you’d expect. No, it’ll be at least 8, and a 64-bit system might even make it 16 bytes. It’s a conscious trade-off of size for speed, and it’s a fair trade in these present days of multi-gigabyte memory. It’s not even that bad on embedded systems, as they grow into the space occupied by PCs a generation ago.

Coming up

For now, I think I’m going to put this series on hold, as I’m not sure where I want it to go. I might move on to a bigger architecture, something like the x86 in its 16-bit days. Until then, back to your regularly scheduled programming posts.

Assembly: the first steps

(Editor’s note: I pretty much gave up on the formatting for this one. Short of changing to a new syntax highlighter, there’s not an awful lot I can do for it, so I just left it as is. I did add an extra apostrophe on one line to stop the thing from thinking it was reading an unclosed string. Sorry for any trouble you might have when reading.)

As I’ve been saying throughout this little series, assembly is the closest we programmers can get to bare metal. On older systems, it was all but necessary to forgo the benefits of a higher-level language, because the speed gains from using assembly outweighed the extra developer time needed to write it. Nowadays, of course, the pendulum has swung quite far in the opposite direction, and assembly is usually only used in those few places where it can produce massive speedups.

But we’re looking at the 6502, a processor that is ancient compared to those of today. And it didn’t have the luxury of high-level languages, except for BASIC, which wasn’t much better than a prettified assembly language. The 6502, before you add in the code stored in a particular system’s ROM, couldn’t even multiply two numbers, much less perform complex string manipulation or operate on data structures.

This post has two code samples, written by myself, that demonstrate two things. First, they show you what assembly looks like, in something more than the little excerpts from last time. Second, they illustrate just how far we’ve come. These aren’t all that great, I’ll admit, and they’re probably not the fastest or smallest subroutines around. But they work for our purposes.

A debug printer

Debugging is a very important part of coding, as any programmer can (or should) agree. Assembly doesn’t give us too much in the way of debugging tools, however. Some assemblers do, and you might get something on your particular machine, but the lowest level doesn’t even have that. So this first snippet prints a byte to the screen in text form.

; Prints the byte in A to the address ($10),Y
; as 2 characters, then a space
printb:
    tax          ; save for later
    ; Some assemblers prefer these as "lsr a" instead
    lsr          ; shift A right 4 bits
    lsr          ; this moves the high bits to the bottom
    lsr
    lsr
    jsr outb     ; we use a subroutine for each character
    txa          ; reload A
    and #$0F     ; mask out the top 4 bits
    jsr outb     ; now print the bottom 4 bits
    lda #$20     ; $20 = ASCII space
    sta ($10),Y
    iny
    rts
outb:
    clc
    adc #$30     ; ASCII codes for digits are $30-$39
    cmp #$39     ; if A > 9, we print a letter, not a digit
    bmi digit
    clc
; Comment out this next line if you're using 6502asm.com '
    adc #$07     ; ASCII codes for A-F are $41-$46
digit:           ; either way, we end up here
    sta ($10),Y
    iny          ; move the "cursor" forward
    rts

You can call this with JSR printb, and it will do just what the comments say: print the byte in the accumulator. You’d probably want to set $10 and $11 to point to video memory. (On many 6502-based systems, that starts at $0400.)

Now, how does it work? The comments should help you—assembly programming requires good commenting—but here’s the gist. Hexadecimal is the preferred way of writing numbers when using assembly, and each hex digit corresponds to four bits. Thus, our subroutine takes the higher four bits (sometimes called a nibble, and occasionally spelled as nybble) and converts them to their ASCII text representation. Then it does the same thing with the lower four bits.

How does it do that part, though? Well, that’s the mini-subroutine at the end, starting at the label outb. I use the fact that ASCII represents the digits 0-9 as hexadecimal $30-$39. In other words, all you have to do is add $30. For hex A-F, this doesn’t work, because the next ASCII characters are punctuation. That’s what the CMP #$39...BMI digit check is for. The code checks to see if it should print a letter; if so, then it adds a further correction factor to get the right ASCII characters. (Since the online assembler doesn’t support true text output, we should comment out this adjustment; we’re only printing pixels, and these don’t need to be changed.)

This isn’t much, granted. It’s certainly not going to replace printf anytime soon. Then again, printf takes a lot more than 34 bytes. Yes, that’s all the space this whole subroutine needs, although it’s still about 1/2000 of the total memory of a 6502-based computer.

If you’re using the online assembler, you’ll probably want to hold on to this subroutine. Coders using a real machine (or emulation thereof) can use the available ROM routines. On a Commodore 64, for example, you might be able to use JSR $FFD2 instead.

Filling a gap

As I stated above, the 6502 processor can’t multiply. All it can do, as far as arithmetic is concerned, is add and subtract. Let’s fix that.

; Multiplies two 8-bit numbers at $20 and $21
; Result is a 16-bit number stored at $22-$23
; Uses $F0-$F2 as scratch memory
multi:
    ldx #$08    ; X holds our counter
    lda #$00    ; clear our result and scratch memory
    sta $22     ; these start at 0
    sta $23
    sta $F1

    lda $20     ; these can be copied
    sta $F0
    lda $21
    sta $F2

nxbit:
    lsr $F2
    bcc next    ; if no carry, skip the addition
    clc
    lda $22     ; 16-bit addition
    adc $F0
    sta $22
    lda $23
    adc $F1
    sta $23

next:
    asl $F0     ; 2-byte shift
    rol $F1
    dex         ; if our counter is > 0, repeat
    bne nxbit
    rts

This one will be harder to adapt to a true machine, since we use a few bytes of the zero page for “scratch” space. When you only have a single arithmetic register, sacrifices have to be made. On newer or more modern machines, we’d be able to use extra registers to hold our temporary results. (We’d also be more likely to have a built-in multiply instruction, but that’s beside the point.)

The subroutine uses a well-known algorithm, sometimes called peasant multiplication, that actually dates back thousands of years. I’ll let Wikipedia explain the details of the method itself, while I focus on the assembly-specific bits.

Basically, our routine is only useful for multiplying a byte by another byte. The result of this is a 16-bit number, which shouldn’t be too surprising. Of course, we only have an 8-bit register to use, so we need to do some contortions to get things to work, one of the problems of using the 6502. (This is almost like a manual version of what compilers call register spilling.)

What’s most important for illustrative purposes isn’t the algorithm itself, though, but the way we call it. We have to set things up in just the right way, with our values at the precise memory locations; we must adhere to a calling convention. When you use a higher-level language, the compiler takes care of this for you. And when you use assembly to interface with higher-level code (the most common use for it today), it’s something you need to watch.

As an example, take a modern x86 system using the GCC compiler. When you call a C function, the compiler emits a series of instructions to account for the function’s arguments and return value. Arguments are pushed to the stack in a call frame, then the function is called. It accesses those arguments by something like the 6502’s indexed addressing mode, then it does whatever it’s supposed to do, and returns a result either in a register (or two) or at a caller-specified memory location. Then, the caller manipulates the stack pointer—much faster than repeatedly popping from the stack—to remove the call frame, and continues execution.

No matter how it’s done, assembly code that’s intended to connect to higher-level libraries—whether in C or some other language—have to respect that language’s calling conventions. Other languages do, too. That’s what extern "C" is for in C++, and it’s also why many other languages have a foreign function interface, or FFI. In our case, however, we’re writing those libraries, and the 6502 is such a small and simple system, so we can make our own calling conventions. And that’s another reason we need good documentation when coding assembly.

Coming up

We’ll keep going through this wonderful, primitive world a while longer. I’ll touch on data structures, because they have a few interesting implications when working at this low level, but we won’t spend too much time on them. After that, who knows?

Assembly: the building blocks

(Editor’s note: Sorry about the formatting for this one. The syntax highlighter I use on here had absolutely no idea what to do with assembly—I think it thought it was Perl—and it screwed everything up. To turn it off, I had to wrap each code sample in HTML code blocks, which killed the indentation.)

So here we are. One of the reasons I chose the 6502 for this little series is because it has such a simple assembly language. It could fit in one post, even covering the intricacies of addressing, the stack, and other bits we’ll get into. Compare that to, say, 16-bit x86, with its fairly complex addressing modes, its segmented memory model, and a completely different I/O system. Add to that the requirement to have an OS, even one such as FreeDOS, and you have quite the task just getting started. The 6502, by contrast, is easy, at least as far as any assembly language can be called easy.

The idea of assembly

In most modern programming languages, things are a bit abstract. You have functions, flow control statements (if, while, and for in C, for example), variables, maybe even objects and exceptions and other neat stuff like that. You’ve got a big standard library full of pre-built routines so you don’t have to reinvent the wheel. In some styles of programming, you aren’t even supposed to care about the low-level details of just how your code runs.

With assembly, all that is gone. It’s just you and the machine. You don’t write assembly code on the level of functions or even statements. You write instructions. You’re telling the computer exactly what to do at each step, and you have to tell it in its own language.

That leads us to a couple of basic concepts regarding assembly. First, each processor has an instruction set. This, obviously, is the set of instructions that it understands. Typically, these all have a prose description like “load accumulator” or “logical shift right”. This notation is a convenience for those studying the instruction set, not for those actually using it. The processor itself doesn’t understand them; it works in numbers like (hexadecimal) $A9 and $4A, what are often called opcodes (a shortened version of “operation codes”). Assembly programmers get a compromise between these extremes: a set of mnemonics, one for each kind of instruction that the processor understands. These are abbreviations, usually only a few letters—the 6502, for example, always uses 3-letter mnemonics. In this form, the two instructions above would be written as LDA and LSR. (Most assemblers nowadays are case-insensitive, so you can write lda and lsr if you want, and I will in assembly language program listings. For describing the instructions themselves, however, I’ll stick to all-caps.)

The second thing to know about assembly also regards its lack of abstractions, but concerning the computer’s memory. Especially on early microprocessors like the 6502, the assembly programmer needs intimate knowledge of the memory layout and how the CPU can access it. Remember, we can’t call a function like in C (f(x,y)). We have to convert even that to a language that the computer understands. How we do that depends very much on the specific system we’re using, so now it’s time to look at the 6502 in particular.

Addressing the 6502

Before we get to the meat of 6502 assembly, we need to look at how a programmer can communicate with the processor. Obviously, most of the work will be done by the registers we saw last time, namely A, X, and Y. Of course, three bytes of usable data isn’t that much, so we’ll be accessing memory almost constantly. And the 6502 offers a few ways to do that—called addressing modes—although some only work with certain instructions.

The first way we can access data not in a register is by putting it in the instruction itself, as an immediate value. On 6502 assemblers, this is usually indicated by a #. For example, LDA #$10 places the value 16 (or $10 in hexadecimal) into the accumulator.

If we want to work with a known location of memory, we might be able to give that location to the instruction using absolute addressing. For example, the Commodore 64’s screen memory is set up so that the upper-left character on the screen is at address $0400. To store the value in the A register there, we could use STA $0400. When using zero-page addresses ($00xx), we can omit the top byte: LDA $FE. This actually saves a byte of memory, which is a lot more important on a system with only 64K than on todays multi-gig computers.

Most of the other addressing modes of the 6502 are in some way indirect, using a value in memory or a register like a pointer (for those of you that know a language like C). These include:

Absolute indirect. Only one instruction actually uses this one. JMP ($FFFE) jumps to the address stored at memory location $FFFE. Since the 6502 has a 16-bit address space, this actually uses the address you give and the one right after it—in this case, $FFFE and $FFFF. (The 6502, like the x86, is little-endian, meaning that the first byte is the low one.)
Relative. This mode is only used by the branching instructions, and it indicates a relative “displacement”. BEQ $08, for example, would jump forward 8 bytes “if equal”. Negative values are allowed, but they’re encoded as two’s-complement numbers (basically, think of N bytes back as $100 - N ahead): BNE $FE jumps back 2 bytes, which makes an awfully effective infinite loop.
Indexed. This is where the X and Y registers start coming into their own. With indexed addressing, one of these is added to the address you give (either 2-byte absolute or 1-byte zero-page). An example would be STA $0400,X, which stores the accumulator value in the address $0400 + X. So, if the X register contains $10, this writes to $0410. Note that some instructions can only use X, some can only use Y, and a few are limited to zero-page addresses.
Indexed indirect and indirect indexed. Don’t worry about mixing the names up on these; they don’t matter. What does matter is how they work. These both use a memory location as a pointer and a register value as an index. The difference is where they add the index in. Indexed indirect adds the X register to the address you give, and creates a pointer from that. Indirect indexed, on the other hand, adds the Y register to the stored value, then uses that as the pointer.

As an example, let’s say that memory locations $30 and $31 each contain the value $80, while $40 and $41 both have $20. Also, both X and Y are set to $10. In this setup, indexed indirect (LDA ($30,X)) takes the memory location $30 + X (i.e., $40) and loads whatever is at the address stored there, essentially as if you’d written LDA $2020. Indirect indexed (LDA ($30),Y) instead takes what is stored at the location you give ($8080, in our example), then adds Y ($10) to that to get the final pointer: $8080 + $10 = $8090. In this case, the effect is the same as LDA $8090.

Finally, assemblers allow the use of labels, so you don’t have to worry about exact addresses. These are the closest you’re going to get to something like named functions. In assembly source code, they’re defined like this: label:. Later on, you can refer to them like you would any memory address, e.g., LDA label or BEQ label. One of the assembler’s jobs is to replace the labels with the”real” addresses, and it’s pretty good at that.

The instructions

After all that, the actual instruction set of the 6502 is refreshingly uncomplicated. All told, there are only a few dozen possible instructions, all of them performing only the most basic of actions. Yet this small arsenal was enough for a generation of 8-bit computers.

Many assembly language references put the instructions in alphabetical order by mnemonics. But the 6502’s set is so small that we can get away with ordering them by what they do. As it turns out, there aren’t too many categories, only about a dozen or so. Also, I’ll have examples for some of the categories, but not all. In the code samples, a ; marks a comment; the rest of the line is ignored, just like // or #, depending on your favorite language.

Load and store

Especially on older, less capable systems like the 6502, moving data around is one of the most important tasks. And there are two ways that data can go: from memory into a register or the other way around. For our CPU, moving a byte from memory to a register is a load, while sending it from a register to a memory location is a store. (The x86, to illustrate a different way of doing things, uses a single instruction, MOV, to do both of these.)

There are three “user” registers on the 6502, and each one has a load and a store instruction. To load a value into a register, you use LDA, LDX, or LDY. To store a value from one of them into memory, it’s STA, STX, and STY. (I think you can figure out which one uses which register.)

In terms of addressing, these instructions are the most well-rounded. The accumulator gives you the most options here, offering immediate, absolute, and all the indirect options. With X and Y, you can’t use indirect addressing, and you can only use the other of the two registers as an index. So you can write LDX $30,Y, but not LDX $30,X.

This code example doesn’t do too much. It sets up the first two memory locations as a pointer to $0400, then writes the byte $36 to that location. For the online assembler I’m using, that makes a blue dot on the left side of the screen, in the middle. On a real C64 or Apple II, that address is the top-left corner of the screen, so it will display whatever the system thinks $36 should be, probably the number 6.



start:

    lda #$00    ; we need 2 loads & 2 stores

    sta $00     ; to set up a 16-bit address

    lda #$04

    sta $01

lda #$36 ldy #$00 ; clear Y to use as an index sta ($00),Y ; stores our byte at $0400

Arithmetic

Besides shuffling data around, computers mainly do math. It’s what they’re best at. As an older microprocessor, the 6502 had to cut corners; by itself, it can only add and subtract, and then only using the accumulator. These two instructions, ADC and SBC, are a little finicky, and they’re our first introduction to the processor status or “flags” register, P. So we’ll take a quick diversion to look at it.

The P register on the 6502, like all its other registers, is a single byte. But we usually don’t care about its byte value as a whole. Instead, we want to look at the individual bits. Since there are eight bits in a byte, there are eight possible flags. The 6502 uses seven of these, although the online assembler doesn’t support two of those, and a third was rarely used even back in the day. So that leaves four that are important enough to mention here:

Bit 7, the Negative (N) flag, is changed after most instructions that affect the A register. It’ll be equal to the high bit of the accumulator, which will always indicate a negative number.
Bit 6, Overflow (V), is set whenever the “sign” of the accumulator changes from arithmetic.
Bit 1 is the Zero (Z) flag, which is only set if the last load, store, or arithmetic instruction ended in a 0.
Bit 0, the Carry (C) flag, is the important one. It’s set when an addition or subtraction causes a result that can’t fit into a byte, as well as when we use some bitwise instructions.

Now, the two arithmetic instructions are ADC and SBC, which stand for “add with carry” and “subtract with carry”. The 6502 doesn’t have a way to add or subtract without involving the carry flag! So, if we don’t want it messing with us, we need to clear it (CLC, which we’ll see again below) before we start doing our addition. Conversely, before subtracting, we must set it with the SEC instruction. (The reason for this is due to the way the processor was designed.)

Also, these instructions only work with the accumulator and a memory address or immediate value. You can’t directly add to X or Y with them, but that’s okay. In the next section, we’ll see instructions that can help us.

The code example here builds on the last one. In the online assembler, it displays a brown pixel next to the blue one. On real computers, it should put a 9 to the right of the 6, because 8-bit coders have dirty minds.



    start:

        lda #$00    ; we need 2 loads & 2 stores

        sta $00     ; to set up a 16-bit address

        lda #$04

        sta $01
        lda #$36

        ldy #$00    ; clear Y to use as an index

        sta ($00),Y ; stores our byte at $0400

clc ; always clear carry first adc #$03 ; A += 3 iny ; move the position right 1 sta ($00),Y ; store the new value

Increment and decrement

The INY (“increment Y register”) instruction I just used is one of a group of six: INC, DEC, INX, DEX, INY, DEY.

All these do instructions do is add or subtract 1, an operation so common that just about every processor in existence has dedicated instructions for it, which is also why C has the ++ and -- operators. For the 6502, these can work on either of our index registers or a memory location. (If you’re lucky enough to have a later model, you also have INA and DEA, which work on the accumulator.)

Our code example this time is an altered version of the last one. This time, instead of incrementing the Y register, we increment the memory location $00 directly. The effect, though, is the same.



    start:

        lda #$00    ; we need 2 loads & 2 stores

        sta $00     ; to set up a 16-bit address

        lda #$04

        sta $01
        lda #$36

        ldy #$00    ; clear Y to use as an index

        sta ($00),Y ; stores our byte at $0400

clc ; always clear carry first adc #$03 ; A += 3 inc $00 ; move the position right 1 sta ($00),Y ; store the new value

Flags

We’ve already seen CLC and SEC. Those are part of a group of instructions that manipulate the flags register. Since we don’t care about all the flags, there’s only one more of these that is important: CLV. All it does is clear the overflow flag, which can come in handy sometimes.

By the way, the other four are two pairs. CLI and SEI work on the interrupt flag, which the online assembler doesn’t support. CLD and SED manipulate the decimal flag, which doesn’t seem to get much use.

There’s no real code example this time, since we’ve already used CLC. SEC works the same way, and I can’t think of a quick use of the overflow flag.

Comparison

Sometimes, it’s useful to just compare numbers, without adding or subtracting. For this, the 6502 offers a trio of arithmetic comparison instructions and one bitwise one.

CMP, CPX, CPY each compare a value in memory to the one in a register (CMP uses A, the others are obvious). If the register value is less than the memory one, the N flag is set. Otherwise, the C flag gets set. If they’re equal, it also sets the Z flag.

BIT works a little differently. It sets the N and V flags to the top two bits of the memory location (no indirection or indexing allowed). Then, it sets the Z flag if the bitwise-AND of the memory byte and the accumulator is zero, i.e., if they have no 1 bits in common.

Comparison instructions are most useful in branching, so I’ll hold off on the example until then.

Branching

Branching is how we simulate the higher-level control structures like conditionals and loops. In the 6502, we have the option of conditionally hopping around our code by using any of nine different instructions. Eight of these come in pairs, each pair based on one of the four main flags.

BCC and BCS branch if the C flag is clear (0) or set (1), respectively.
BNE (“branch not equal”) and BEQ (“branch equal”) do the same for the Z flag.
BVC and BVS branch based on the state of the V flag.
BPL (“branch plus”) and BMI (“branch minus”) work on the N flag.

All of these use the “relative” addressing mode, limiting them to short jumps.

The ninth instruction is JMP, and it can go anywhere. You can use it with a direct address (JMP $FFFE) or an indirect one (JMP ($0055)), and it always jumps. Put simply, it’s GOTO. But that’s not as bad as it sounds. Remember, we don’t have the luxury of while or for. JMP is how we make those.

This code sample, still building on our earlier attempts, draws nine dots (or the digits 0-9) on the screen.



    start:

        lda #$00

        sta $00

        lda #$04

        sta $01
        lda #$30

        ldy #$00

loop: sta ($00),Y ; write the byte to the screen clc adc #$01 ; add 1 to A for next character iny ; move 1 character to the right cpy #$0a ; have we reached 10 yet? bne loop ; if not, go again

For comparison, a pseudo-C version of the same thing:

char* screen = 0x0400;
char value = 0x30;
for (int i = 0; i < 10; i++) {
    screen[i] = value;
    value++;
}

The stack

The stack, on the 6502 processor, is the second page of memory, starting at address $0100. It can be used to store temporary values, addresses, and other data, but it’s all accessed through the stack pointer (SP). You push a value onto the stack, then pop (or pull, to use 6502 terminology) it back off when you need it back.

We’ve got an even half dozen instructions to control the stack. We can push the accumulator value onto it with PHA, and we can do the same with the flags by using PHP. (Not the programming language with that name, thankfully.) Popping—or pulling, if you prefer the archaic term—the value pointed to by the SP uses PLA and PLP. The other two instructions, TSX and TXS let us copy the stack pointer to the X register, or vice versa.

Subroutines

Branches give us flow control, an important part of any high-level programming. For functions, the assembly programmer uses subroutines, and the 6502 has a pair of instructions that help us implement them. JSR (“jump to subroutine”) is an unconditional jump like JMP, except that it pushes the address of the next instruction to the stack before jumping. (Since we only have a page of stack space, this limits how “deep” you can go.) When the subroutine is done, the RTS instruction sends you back to where you started, just after the JSR.

The code sample here shows a little subroutine. See if you can figure out what it does.



    start:

        lda #$00

        sta $00

        lda #$04

        sta $01
        lda #$31

        ldy #$09

        jsr show    ; call our subroutine

        jmp end     ; jump past when we're done
    show:
        sta ($00),Y ; write the byte to screen mem

        clc

        adc #$01    ; add 1 to accumulator

        dey

        bne show    ; loop until Y = 0

        rts         ; return when we're done

end: ; label so we can skip the subroutine

Bitwise

We’ve got a total of seven bitwise instructions (not counting BIT, which is different). Three of these correspond to the usual AND, OR, and XOR operations, and they work on a memory location and the accumulator. AND has an obvious name, ORA stands for “OR with accumulator”, and EOR is “exclusive OR”. (Why they used “EOR” instead of “XOR”, I don’t know.) If you’ve ever used the bit-twiddling parts of C or just about any other language, you know how these work. These three instructions also change the Z and N flags: Z if the result is 0, N if the highest bit of the result is set.

The other four manipulate the bits of memory or the accumulator themselves. ASL is “arithmetic shift left”, identical to the C << operator, except that it only works one bit at a time. The high bit is shifted into the C flag, while Z and N are altered like you’d expect. LSR (“logical shift right”) works mostly in reverse: every bit is shifted down, a 0 is moved into the high bit, and the low bit goes into C.

ROL and ROR (“rotate left/right”) are the oddballs, as few higher-level languages have a counterpart to them. Really, though, they’re simple. ROL works just like ASL, except that it shifts whatever was in the C flag into the low bit instead of always a 0. ROR is the same, but the other way around, putting the C flag’s value into the high bit.

Transfer

We could move bytes between the A, X, and Y registers by copying them to memory or using the stack instructions. That’s time-consuming, though. Instead, we’ve got the TAX, TAY, TXA, and TYA instructions. These transfer a value from one register to another, with the second letter of the mnemonic as the source and the third as the destination. (TAX copies A to X, etc.) The flags are set how you’d expect.

The other guys

There are two other 6502 assembly instructions that don’t do too much. BRK causes an interrupt, which the online assembler can’t handle and isn’t that important for user-level coding. NOP does nothing at all. It’s used to fill space, basically.

Next time

Whew. Maybe I was wrong about fitting all this in one post. Still, that’s essentially everything you need to know about the 6502 instruction set. The web has a ton of tutorials, all of them better than mine. But this is the beginning. In the next part, we’ll look at actually doing things with assembly. That one will be full of code, too.

Assembly: welcome to the machine

Well, we have a winner, and it’s the 6502. Sure, it’s simple and limited, and it’s actually the oldest (this year, it turned 40) of our four finalists. But I chose it for a few reasons. First, the availability of an online assembler and emulator at 6502asm.com. Second, because of its wide use in 8-bit computers, especially those I’ve used (you might call this nostalgia, but it’s a practical reason, too). And finally, its simplicity. Yes, it’s limited, but those limitations make it easier to understand the system as a whole.

Before we get started

A couple of things to point out before I get into the details of our learning system. I’m using the online assembler at the link above. You can use that if you like. If you’d rather use something approximating a real 6502, there are plenty of emulators out there. There’s a problem, though. The 6502 was widely used as a base for the home computers of the 70s and 80s, but each system changed things a little. And the derivative processors, such as the Commodore 64’s MOS 6510 or the Ricoh 2A03 used in the NES, each had their own quirks. So, this series will focus on an “ideal” 6502 wherever possible, only entering the real world when necessary.

A second thing to note is the use of hexadecimal (base-16) numbers. They’re extremely common in assembly programming; they might even be used more than decimal numbers. But writing them down poses a problem. The mathematically correct way is to use a subscript: 0400₁₆. That’s awfully hard to do, especially on a computer, so programmers developed alternatives. In modern times, we mostly use the prefix “0x”: 0x0400. In the 8-bit days, however, the convention was a prefixed dollar sign: $0400. Since that’s the tradition for 6502-based systems, that’s what we’ll use here.

The processor

Officially, we’re discussing the MOS Technology 6502 microprocessor. Informally, everybody just calls it the 6502. It’s an 8-bit microprocessor that was originally released in 1975, a full eight years before I was born, in what might as well be the Stone Age in terms of computers. Back then, it was cheap, easy to use, and developer-friendly, exactly what most other processors weren’t. And those qualities gave it its popularity among hobbyists and smaller manufacturers, at a time when most people couldn’t even imagine wanting a computer at home.

Electronically, the 6502 is a microprocessor. Basically, all that really means is that it isn’t meant to do much by itself, but it’s intended to drive other chips. It’s the core of the system, not the whole thing. In the Commodore 64, for example, the 6502 (actually 6510, but close enough) was accompanied by the VIC-II graphics chip, the famous SID chip for sound, and a pair of 6526 microcontrollers for input and output (I/O). Other home computers had their own families of companion chips, and it was almost expected that you’d add peripherals for increased functionality.

Internally, there’s not too much to tell. The 6502 is 8-bit, which means that it works with byte-sized machine words. It’s little-endian, so larger numbers are stored from the lowest byte up. (Example: the decimal number 1024, hexadecimal $0400, would be listed in a memory readout as 00 04.) Most 6502s run at somewhere around 1 MHz, but some were clocked faster, up to 2 MHz.

For an assembly programmer, the processor is very much bare-bones. It can access a 16-bit memory space, which means a total of 65,536 bytes, or 64K. (We’ll ignore the silly “binary” units here.) You’ve got a mere handful of registers, including:

The accumulator (A), which is your only real “working” register,
Two index registers (X and Y), used for indirect memory access,
A stack pointer (SP), which is nominally 16-bit, but the upper byte is hardwired to $01,
The processor status register (P), a set of “flag” bits that are used to determine certain conditions,
A program counter (PC) that keeps track of the address of the currently-executing assembly instruction.

That’s…not a lot. By contrast, the 8086 had 14 registers (4 general purpose, 2 index, 2 stack pointer, 4 segment registers, an instruction pointer, and the processor flags). Today’s x86-64 processors add quite a few more (e.g., 8 more general purpose, 2 more segment registers that are completely useless in 64-bit mode, 8 floating-point registers, and 8 SIMD registers). But, in 1975, it wasn’t easy to make a microprocessor that sold for $25. Or $100, for that matter, so that’s what you had to work with. The whole early history of personal computers, in fact, is an epic tale of cutting corners. (The ZX81, for example, had a “slow” and a “fast” mode. In fast mode, it disabled video output, because that was the only way the processor could run code at full speed!)

Memory

Because of the general lack of registers inside the processor, memory becomes of paramount importance on the 6502. Now, we don’t think of it much today, but there are two main kinds of memory: read-only (ROM) and writable (“random access”, hence RAM). How ROM and RAM were set up was a detail for the computer manufacturer or dedicated hobbyist; the processor itself didn’t really care. The Apple IIe, for example, had 64K of RAM and 16K of ROM; the Commodore 64 gave you the same amount of RAM, but had a total of 24K of ROM.

Astute readers—anyone who can add—will note that I’ve already said the 6502 had a 16-bit memory address space, which caps at 64K. That’s true. However much memory you really had (e.g., 80K total on the Apple IIe), assembly code could only access 64K of it at a time. Different systems had different ways of coping with this, mostly by using bank switching, where a part of address space could be switched to show a different “window” of the larger memory.

One quirk of the 6502’s memory handling needs to be mentioned, because it forms a very important part of assembly programming on the processor. Since the 6502 is an 8-bit processor, it’s natural to divide memory into pages, each page being 256 bytes. In hexadecimal terms, pages start at $xx00 (where xx can be anything) and run to $xxFF. The key thing to notice is that the higher byte stays the same for every address in the same page. Since the computer only works with bytes, the less we have to cross a page “boundary” (from $03FF to $0400, for instance), the better.

The processor itself even acknowledges this. The zero page, the memory located from $0000 to $00FF, can be used in 6502 assembly as one-byte addresses. And because the 6502 wasn’t that fast to begin with, and memory wasn’t that much slower, it’s almost like having an extra 256 registers! (Of course, much of this precious memory space is reserved on actual home computers, meaning that it’s unavailable for us. Even 6502asm uses two bytes of the zero page, $FE and $FF, for its own purposes.)

Video and everything else

Video display depended almost completely on the additional hardware installed alongside a 6502 processor. The online assembler I’ll be using has a very simplified video system: 32×32 pixels, each pixel taking up one byte, running from address $0200 to $05FF, with 16 possible colors. Typically, actual computers gave you much more. Most of them had a text mode (40×24, 80×25, or something like that) that may or may not have offered colors, along with a high-res mode that was either monochrome or very restricted in colors.

Almost any other function you can think of is also dependent on the system involved, rather than being a part of the 6502 itself. Our online version doesn’t have any extra bells and whistles, so I won’t be covering them in the near future. If interest is high enough, however, I might go back and delve deeper into one of the many emulators available.

Coming up

So that’s pretty much it for the basics of the 6502. A bare handful of registers, not even enough memory to hold the stylesheet for this post, and a bunch of peripherals that were wholly dependent upon the manufacturer of the specific computer you used. And it was still the workhorse of a generation. After four decades, it looks primitive to us, because it is. But every journey starts somewhere, and sometimes we need to go back to a simpler time because it was simpler.

That’s the case here. While I could certainly demonstrate assembly by making something in modern, 64-bit x86 code, it wouldn’t have the same impact. Modern assembly, on a larger scale, is often not much more than a waste of developer resources. But older systems didn’t have the luxury of optimizing compilers and high-level functional programming. For most people in the 80s, you used BASIC to learn how to program, then you switched to assembly when you wanted to make something useful. That was okay, because everybody else had the same limitations, and the system itself was so small that you could be productive with assembly.

In the next post of this series, we’ll actually start looking at the 6502 assembly language. We’ll even take a peek under that hood, going down to the ultimate in low-level, machine code. I hope you’ll enjoy reading about it as much as I am writing it.

Assembly: narrowing the field

Okay, I guess I told a little fib. But I’ve got a good reason. I’ve been thinking this over for a while, and I still haven’t come to a decision. I want to do a few posts introducing an assembly language and the “style” of programming it, but I don’t know which assembly language to use. So, in the interest of better informing you (and giving myself an extra week to figure out an answer), I’ve decided to put some of my thought processes into words.

The contenders

I know a handful of assembly variants, and I know of quite a few more, especially if you count different versions of architectures as separate languages. Some of these are more useful for introductory purposes than others, while some are much more useful in actual work. These two subsets, unfortunately, don’t overlap much. But I’ve come up with four finalists for this coveted crown, and here they are, complete with my own justifications.

6502

If you’re older than about 25, then you’ve probably used a 6502-based system before, whether you knew it or not. It was a mainstay for 80s “home computers”, including the Apple II and Commodore 64, and a customized version was used in the NES. It’s still a hobbyist favorite, mostly for nostalgia reasons rather than any technical superiority, and there’s no sign that it will ever become truly extinct. (Bender, after all, has one in his head, which explains a lot.)

Pros:

8-bit processors are about as simple as you can get while still being useful.
There’s a lot of work out there already, in terms of tutorials, programming guides, etc.
An online assembler exists, which makes things much easier.
Plenty of emulators are available, although these are usually for specific 6502 computers.

Cons:

8-bit can be very limiting, mostly because it is so simple.
There aren’t many registers, which slows down a lot of work and means a lot of memory trickery.
Despite what its followers might think, 6502 is pretty much a dead end for development, as it has been for about 20 years.

Early x86

By “early”, I specifically mean the 16-bit x86 processors, the 8086 and 80286. These are the CPUs of the first IBM-compatible personal computers, and the ancestors of the i5 and A10 we use today. You would think that would give it direct relevance, but you’d actually be wrong. Today’s x86 processors, when running in 64-bit mode, actually can’t directly run 16-bit code. But we’d be using an emulator of some sort, anyway, so that’s not a problem that would concern us.

Pros:

Very familiar and widespread, as (a descendent of) x86 is still used in just about every PC today.
16-bit isn’t that much more complicated than 8-bit.
All those old DOS assembly tutorials are out there, somewhere, and most of them contain useful information.
Even though current processors can’t execute 16-bit code directly, you can still use one of the dozens of emulators out there, including DOSBox.

Cons:

The segmented memory architecture is weird and hard to explain.
It’s easy to fall into the trap of using skills and tricks that were relevant here in more modern applications, where they simply don’t carry over.
A lot of people just don’t like x86 and think it’s horrible; I don’t understand this, but I respect it.

AVR

Atmel’s AVR line of microcontrollers is pretty popular in the embedded world. One of them powers the Arduino, for example. Thus, there’s actually a built-in market for AVR assembly, although most programmers now use C. Of course, AVR has its own problems, not the least is its odd way of segregating program and data memory.

Pros:

Very relevant today as an embedded platform.
A ton of support online, including tutorials, forums, etc.
The assembly language and architecture, despite a few warts, is actually nice, in my opinion.
Lots of good tools, including a port of GCC.

Cons:

Emulator quality is hit or miss. (AVR Studio was okay when I used it 4 years ago, but it’s non-free and Windows only, and the free options are mostly beta quality.)
The Harvard architecture (totally separate memory spaces for code and data) is used by almost nothing else today, and it’s cumbersome at best.
AVR is very much a microcontroller platform, not a microprocessor one. It’s intended for use in embedded systems, not PCs.

MIPS

MIPS is a bit of an oddball. Sure, it’s used in a few places out there. There’s a port of Android to it, and MIPS CPUs were used in most of the 90s consoles, like the N64 and PlayStation. There’s not much modern development on it, though, except in the Chinese Loongson, which started out as a knock-off, but then became a true MIPS implementation. But its true value seems to be in its assembly language, which is often recommended as a good way to learn the ropes.

Pros:

Fairly simple assembly language and a sensible architecture.
Tools are widespread, including cross-compilers, emulators, and even native versions of Linux.
RISC, if you like that. After all, “RISC is good”, to quote Hackers.

Cons:

Not quite as relevant as it once was. (If I had written this 20 years ago, the choice would probably be between this and x86.)
A bit more complex than the others, which actually removes some of the need for assembly.
I don’t really know it that well, so I would have to learn it first.

The also-rans

Those aren’t the only assembly language platforms out there. There are quite a few that are popular enough that they could fit here, but I didn’t pick them for whatever reason. Some of them include:

PowerPC: Used in Macs from about a decade ago, as well as the consoles of the 2000s (GameCube, Wii, PS3, XBox 360), but I don’t know much about it, and it’s mostly on servers now.
68000: The 16-bit CPU from the Sega Genesis and the original Macintosh, later developed into a true 32-bit processor. It has its supporters, and it’s not that bad, but again, I don’t know it like I do the others.
Z80: This one was used by a few home computers, like the ZX80, Kaypro, and (my favorite) the TRS-80. It’s a transparent remake of the 8080 (forerunner to the 8086) with just enough detail changed to avoid lawsuits. But I know x86 better.
PIC: One of the most popular architectures in the 8-bit embedded world. I’ve read a little bit about it, and I don’t exactly like what I see. Its assembly requires a few contortions that I think distract from the task at hand.
ARM: The elephant in the room. Technically, ARM is a whole family of architectures, each slightly different, and that is the main problem. The tools are great, the assembly language isn’t half bad (but increasingly less necessary). It’s just that there’s too much choice.
MIX: Donald Knuth invented this fictional assembly language for his series The Art of Computer Programming. Then, after 30 years of work, he scrapped it for the next edition, replacing it with the “modern” MMIX. Neither one of them works here, in my opinion. MMIX has a lack of tool support (since it’s not a “real” machine language) and the best tutorial is Knuth’s book. MIX is even worse, since it’s based on the horrible architectures of the 60s and early 70s.

Better times ahead

Hopefully, I’ll come to a decision soon. Then I can start the hard part: actually making assembly language interesting. Wish me luck.