Assembly: a little bit more

Well, I’m back. Instead of giving you more apologies for missing a couple of weeks of this exciting series (sarcasm alert!), let’s jump right back in and look at some more old-school assembly language. This week, we’ll get to know homebrewed 6502 versions of a couple of C standard library staples, and we can start talking about how you use data structures in assembly.

Memory and data

The simplest, dumbest (for the computer, not the programmer) way to treat data is as raw memory. The problem is, there’s not much you can do with it. You can initialize it, copy it around, and that’s about it. Everything else needs some structure. Copying in assembly language is pretty easy, though, even in 6502-land:

; Arguments:
; $F0-$F1: Source address
; $F2-$F3: Destination address
; Y register: Byte count
    lda ($F0), Y    ; 2 bytes, 5 cycles
    sta ($F2), Y    ; 2 bytes, 6 cycles
    dey             ; 1 byte, 2 cycles
    bne memcpy      ; 2 bytes, 2-3 cycles
    rts             ; 1 byte, 6 cycles

Yep, this is a stripped-down version of memcpy. It has its limitations—it can only copy a page of memory at a time, and it has no error checking—but it’s short and to the point. Note that, instead of a prose description of the subroutine’s arguments and return values and whatnot, I’m just putting that in the comments before the code. I trust that you can understand how to work with that.

Since the code is pretty self-explanatory, the comments for each line show the size and time taken by each instruction. A little bit of addition should show you that the whole subroutine is only 8 bytes; even on modern processors, the core of memcpy isn’t exactly huge.

The timing calculation is a little more complex, but it’s no less important on a slow, underpowered CPU like the 6502. In the case of our subroutine, it depends on how many bytes we’re copying. The core of the loop will take 13 cycles for each iteration. The branch instruction is 3 cycles when the branch is taken, 2 cycles when it’s missed. Altogether, copying n bytes takes 16n+5 cycles, a range of 21 to 4101. (A zero byte count is treated as 256.) In a modern computer, four thousand cycles would be a few microseconds at most. For the 6502, however, it’s more like a few milliseconds, but it’s hard to get faster than what we’ve got here.


The first way we can give structure to our data is with strings. Particularly, we’ll look at C-style strings, series of bytes terminated by a null value, hex $00. One of the first interesting operations is taking the string’s length—the C standard library’s strlen—and this is one implementation of it in 6502 assembly:

; Arguments:
; $F0-$F1: String address
; Returns:
; A: Length of null-terminated string
    ldy #$00        ; 2 bytes, 2 cycles
    clv             ; 1 byte, 2 cycles
    lda ($F0), Y    ; 2 bytes, 5 cycles
    beq slend       ; 2 bytes, 2-3 cycles
    iny             ; 1 byte, 2 cycles
    bvc slloop      ; 2 bytes, 3 cycles
    tya             ; 1 byte, 2 cycles
    rts             ; 1 byte, 6 cycles

All it does is count up through memory, starting at the pointed-to address, until it reaches a zero byte. When it’s done, it gives back the result in the accumulator. Now, this comes with an obvious restriction: our strings can’t be more than 255 bytes, or we get wraparound. For this example, that’s fine, but you need to watch out in real code. Of course, in modern processors, you’ll usually have at least a 32-bit register to work with, and there aren’t too many uses for a single string of a few billion bytes.

Our assembly version of strlen weighs in at 12 bytes. Timing-wise, it’s 12n+20 cycles for a string of length n, which isn’t too bad. The only real trickery is abusing the overflow flag to allow us an unconditional branch, since none of the instructions this subroutine uses will affect it. Using a simple JMP instruction is equivalent in both time and space, but it means we can’t relocate the code once it has been assembled.

Another common operation is comparing strings, so here’s our version of C’s strcmp:

; Arguments:
; $F0-$F1: First string
; $F2-$F3: Second string
; Returns comparison result in A:
; -1: First string is less than second
; 0: Strings are equal
; 1; First string is greater than second
    ldy #$00        ; 2 bytes, 2 cycles
    lda ($F0), Y    ; 2 bytes, 5 cycles
    cmp ($F2), Y    ; 2 bytes, 5 cycles
    bne scdone      ; 2 bytes, 2-3 cycles
    iny             ; 1 byte, 2 cycles
    cmp #$00        ; 2 bytes, 2 cycles
    bne scload      ; 2 bytes, 2-3 cycles
    lda #$00        ; 2 bytes, 2 cycles
    rts             ; 1 byte, 6 cycles
    bcs scgrtr      ; 2 bytes, 2-3 cycles
    lda #$FF        ; 2 bytes, 2 cycles
    rts             ; 1 byte, 6 cycles
    lda #$01        ; 2 bytes, 2 cycles
    rts             ; 1 byte, 6 cycles

Like the its C namesake, our strcmp doesn’t care about alphabetical order, only the values of the bytes themselves. The subroutine uses just 24 bytes, though, so you can’t ask for too much. (Timing for this one is…nontrivial, so I’ll leave it to the more interested reader.)

Other structures

Arrays, in theory, would work almost like strings. Instead of looking for null bytes, you’d have an explicit count, more like newer C functions such as strncmp. On the 6502, the indirect indexed addressing mode (e.g., LDA ($F0), Y) we’ve used in every example so far is your main tool for this. Other architectures have their own variations, like the x86’s displacement mode.

More complex structures (like C structs or C++ classes), are tougher. This is where the assembly programmer needs a good understanding of how high-level compilers implement such things. Issues like layout, padding, and alignment come into play on modern computers, while the 6502 suffers from the slower speed of indirection.

Self-contained structures (those that won’t be interfacing with higher-level components) are really up to you. The most common layout is linear, with each member of the structure placed consecutively in memory. This way, you’re only working with basic offsets.

But there’s a problem with that, as newer systems don’t really like to access any old byte. Rather, they’ll pull in some number of bytes (always a power of 2: 2, 4, 8, 16, etc.) all at once. Unaligned memory accesses, such as loading a 32-bit value stored at memory location 0x01230001 (using x86-style hex notation) will be slower, because the processor will want to load two 32-bit values—0x01230000 and 0x01230004—and then it has to do a little bit of internal shuffling. Some processors won’t even go that far; they’ll give an error at the first sign of an unaligned access.

For both of these reasons, modern languages generate structures with padding. A C struct containing a byte and a 32-bit word (in that order), won’t take up the 5 bytes you’d expect. No, it’ll be at least 8, and a 64-bit system might even make it 16 bytes. It’s a conscious trade-off of size for speed, and it’s a fair trade in these present days of multi-gigabyte memory. It’s not even that bad on embedded systems, as they grow into the space occupied by PCs a generation ago.

Coming up

For now, I think I’m going to put this series on hold, as I’m not sure where I want it to go. I might move on to a bigger architecture, something like the x86 in its 16-bit days. Until then, back to your regularly scheduled programming posts.

Assembly: welcome to the machine

Well, we have a winner, and it’s the 6502. Sure, it’s simple and limited, and it’s actually the oldest (this year, it turned 40) of our four finalists. But I chose it for a few reasons. First, the availability of an online assembler and emulator at Second, because of its wide use in 8-bit computers, especially those I’ve used (you might call this nostalgia, but it’s a practical reason, too). And finally, its simplicity. Yes, it’s limited, but those limitations make it easier to understand the system as a whole.

Before we get started

A couple of things to point out before I get into the details of our learning system. I’m using the online assembler at the link above. You can use that if you like. If you’d rather use something approximating a real 6502, there are plenty of emulators out there. There’s a problem, though. The 6502 was widely used as a base for the home computers of the 70s and 80s, but each system changed things a little. And the derivative processors, such as the Commodore 64’s MOS 6510 or the Ricoh 2A03 used in the NES, each had their own quirks. So, this series will focus on an “ideal” 6502 wherever possible, only entering the real world when necessary.

A second thing to note is the use of hexadecimal (base-16) numbers. They’re extremely common in assembly programming; they might even be used more than decimal numbers. But writing them down poses a problem. The mathematically correct way is to use a subscript: 040016. That’s awfully hard to do, especially on a computer, so programmers developed alternatives. In modern times, we mostly use the prefix “0x”: 0x0400. In the 8-bit days, however, the convention was a prefixed dollar sign: $0400. Since that’s the tradition for 6502-based systems, that’s what we’ll use here.

The processor

Officially, we’re discussing the MOS Technology 6502 microprocessor. Informally, everybody just calls it the 6502. It’s an 8-bit microprocessor that was originally released in 1975, a full eight years before I was born, in what might as well be the Stone Age in terms of computers. Back then, it was cheap, easy to use, and developer-friendly, exactly what most other processors weren’t. And those qualities gave it its popularity among hobbyists and smaller manufacturers, at a time when most people couldn’t even imagine wanting a computer at home.

Electronically, the 6502 is a microprocessor. Basically, all that really means is that it isn’t meant to do much by itself, but it’s intended to drive other chips. It’s the core of the system, not the whole thing. In the Commodore 64, for example, the 6502 (actually 6510, but close enough) was accompanied by the VIC-II graphics chip, the famous SID chip for sound, and a pair of 6526 microcontrollers for input and output (I/O). Other home computers had their own families of companion chips, and it was almost expected that you’d add peripherals for increased functionality.

Internally, there’s not too much to tell. The 6502 is 8-bit, which means that it works with byte-sized machine words. It’s little-endian, so larger numbers are stored from the lowest byte up. (Example: the decimal number 1024, hexadecimal $0400, would be listed in a memory readout as 00 04.) Most 6502s run at somewhere around 1 MHz, but some were clocked faster, up to 2 MHz.

For an assembly programmer, the processor is very much bare-bones. It can access a 16-bit memory space, which means a total of 65,536 bytes, or 64K. (We’ll ignore the silly “binary” units here.) You’ve got a mere handful of registers, including:

  • The accumulator (A), which is your only real “working” register,
  • Two index registers (X and Y), used for indirect memory access,
  • A stack pointer (SP), which is nominally 16-bit, but the upper byte is hardwired to $01,
  • The processor status register (P), a set of “flag” bits that are used to determine certain conditions,
  • A program counter (PC) that keeps track of the address of the currently-executing assembly instruction.

That’s…not a lot. By contrast, the 8086 had 14 registers (4 general purpose, 2 index, 2 stack pointer, 4 segment registers, an instruction pointer, and the processor flags). Today’s x86-64 processors add quite a few more (e.g., 8 more general purpose, 2 more segment registers that are completely useless in 64-bit mode, 8 floating-point registers, and 8 SIMD registers). But, in 1975, it wasn’t easy to make a microprocessor that sold for $25. Or $100, for that matter, so that’s what you had to work with. The whole early history of personal computers, in fact, is an epic tale of cutting corners. (The ZX81, for example, had a “slow” and a “fast” mode. In fast mode, it disabled video output, because that was the only way the processor could run code at full speed!)


Because of the general lack of registers inside the processor, memory becomes of paramount importance on the 6502. Now, we don’t think of it much today, but there are two main kinds of memory: read-only (ROM) and writable (“random access”, hence RAM). How ROM and RAM were set up was a detail for the computer manufacturer or dedicated hobbyist; the processor itself didn’t really care. The Apple IIe, for example, had 64K of RAM and 16K of ROM; the Commodore 64 gave you the same amount of RAM, but had a total of 24K of ROM.

Astute readers—anyone who can add—will note that I’ve already said the 6502 had a 16-bit memory address space, which caps at 64K. That’s true. However much memory you really had (e.g., 80K total on the Apple IIe), assembly code could only access 64K of it at a time. Different systems had different ways of coping with this, mostly by using bank switching, where a part of address space could be switched to show a different “window” of the larger memory.

One quirk of the 6502’s memory handling needs to be mentioned, because it forms a very important part of assembly programming on the processor. Since the 6502 is an 8-bit processor, it’s natural to divide memory into pages, each page being 256 bytes. In hexadecimal terms, pages start at $xx00 (where xx can be anything) and run to $xxFF. The key thing to notice is that the higher byte stays the same for every address in the same page. Since the computer only works with bytes, the less we have to cross a page “boundary” (from $03FF to $0400, for instance), the better.

The processor itself even acknowledges this. The zero page, the memory located from $0000 to $00FF, can be used in 6502 assembly as one-byte addresses. And because the 6502 wasn’t that fast to begin with, and memory wasn’t that much slower, it’s almost like having an extra 256 registers! (Of course, much of this precious memory space is reserved on actual home computers, meaning that it’s unavailable for us. Even 6502asm uses two bytes of the zero page, $FE and $FF, for its own purposes.)

Video and everything else

Video display depended almost completely on the additional hardware installed alongside a 6502 processor. The online assembler I’ll be using has a very simplified video system: 32×32 pixels, each pixel taking up one byte, running from address $0200 to $05FF, with 16 possible colors. Typically, actual computers gave you much more. Most of them had a text mode (40×24, 80×25, or something like that) that may or may not have offered colors, along with a high-res mode that was either monochrome or very restricted in colors.

Almost any other function you can think of is also dependent on the system involved, rather than being a part of the 6502 itself. Our online version doesn’t have any extra bells and whistles, so I won’t be covering them in the near future. If interest is high enough, however, I might go back and delve deeper into one of the many emulators available.

Coming up

So that’s pretty much it for the basics of the 6502. A bare handful of registers, not even enough memory to hold the stylesheet for this post, and a bunch of peripherals that were wholly dependent upon the manufacturer of the specific computer you used. And it was still the workhorse of a generation. After four decades, it looks primitive to us, because it is. But every journey starts somewhere, and sometimes we need to go back to a simpler time because it was simpler.

That’s the case here. While I could certainly demonstrate assembly by making something in modern, 64-bit x86 code, it wouldn’t have the same impact. Modern assembly, on a larger scale, is often not much more than a waste of developer resources. But older systems didn’t have the luxury of optimizing compilers and high-level functional programming. For most people in the 80s, you used BASIC to learn how to program, then you switched to assembly when you wanted to make something useful. That was okay, because everybody else had the same limitations, and the system itself was so small that you could be productive with assembly.

In the next post of this series, we’ll actually start looking at the 6502 assembly language. We’ll even take a peek under that hood, going down to the ultimate in low-level, machine code. I hope you’ll enjoy reading about it as much as I am writing it.