Assembly: a little bit more

Well, I’m back. Instead of giving you more apologies for missing a couple of weeks of this exciting series (sarcasm alert!), let’s jump right back in and look at some more old-school assembly language. This week, we’ll get to know homebrewed 6502 versions of a couple of C standard library staples, and we can start talking about how you use data structures in assembly.

Memory and data

The simplest, dumbest (for the computer, not the programmer) way to treat data is as raw memory. The problem is, there’s not much you can do with it. You can initialize it, copy it around, and that’s about it. Everything else needs some structure. Copying in assembly language is pretty easy, though, even in 6502-land:

; Arguments:
; $F0-$F1: Source address
; $F2-$F3: Destination address
; Y register: Byte count
    lda ($F0), Y    ; 2 bytes, 5 cycles
    sta ($F2), Y    ; 2 bytes, 6 cycles
    dey             ; 1 byte, 2 cycles
    bne memcpy      ; 2 bytes, 2-3 cycles
    rts             ; 1 byte, 6 cycles

Yep, this is a stripped-down version of memcpy. It has its limitations—it can only copy a page of memory at a time, and it has no error checking—but it’s short and to the point. Note that, instead of a prose description of the subroutine’s arguments and return values and whatnot, I’m just putting that in the comments before the code. I trust that you can understand how to work with that.

Since the code is pretty self-explanatory, the comments for each line show the size and time taken by each instruction. A little bit of addition should show you that the whole subroutine is only 8 bytes; even on modern processors, the core of memcpy isn’t exactly huge.

The timing calculation is a little more complex, but it’s no less important on a slow, underpowered CPU like the 6502. In the case of our subroutine, it depends on how many bytes we’re copying. The core of the loop will take 13 cycles for each iteration. The branch instruction is 3 cycles when the branch is taken, 2 cycles when it’s missed. Altogether, copying n bytes takes 16n+5 cycles, a range of 21 to 4101. (A zero byte count is treated as 256.) In a modern computer, four thousand cycles would be a few microseconds at most. For the 6502, however, it’s more like a few milliseconds, but it’s hard to get faster than what we’ve got here.


The first way we can give structure to our data is with strings. Particularly, we’ll look at C-style strings, series of bytes terminated by a null value, hex $00. One of the first interesting operations is taking the string’s length—the C standard library’s strlen—and this is one implementation of it in 6502 assembly:

; Arguments:
; $F0-$F1: String address
; Returns:
; A: Length of null-terminated string
    ldy #$00        ; 2 bytes, 2 cycles
    clv             ; 1 byte, 2 cycles
    lda ($F0), Y    ; 2 bytes, 5 cycles
    beq slend       ; 2 bytes, 2-3 cycles
    iny             ; 1 byte, 2 cycles
    bvc slloop      ; 2 bytes, 3 cycles
    tya             ; 1 byte, 2 cycles
    rts             ; 1 byte, 6 cycles

All it does is count up through memory, starting at the pointed-to address, until it reaches a zero byte. When it’s done, it gives back the result in the accumulator. Now, this comes with an obvious restriction: our strings can’t be more than 255 bytes, or we get wraparound. For this example, that’s fine, but you need to watch out in real code. Of course, in modern processors, you’ll usually have at least a 32-bit register to work with, and there aren’t too many uses for a single string of a few billion bytes.

Our assembly version of strlen weighs in at 12 bytes. Timing-wise, it’s 12n+20 cycles for a string of length n, which isn’t too bad. The only real trickery is abusing the overflow flag to allow us an unconditional branch, since none of the instructions this subroutine uses will affect it. Using a simple JMP instruction is equivalent in both time and space, but it means we can’t relocate the code once it has been assembled.

Another common operation is comparing strings, so here’s our version of C’s strcmp:

; Arguments:
; $F0-$F1: First string
; $F2-$F3: Second string
; Returns comparison result in A:
; -1: First string is less than second
; 0: Strings are equal
; 1; First string is greater than second
    ldy #$00        ; 2 bytes, 2 cycles
    lda ($F0), Y    ; 2 bytes, 5 cycles
    cmp ($F2), Y    ; 2 bytes, 5 cycles
    bne scdone      ; 2 bytes, 2-3 cycles
    iny             ; 1 byte, 2 cycles
    cmp #$00        ; 2 bytes, 2 cycles
    bne scload      ; 2 bytes, 2-3 cycles
    lda #$00        ; 2 bytes, 2 cycles
    rts             ; 1 byte, 6 cycles
    bcs scgrtr      ; 2 bytes, 2-3 cycles
    lda #$FF        ; 2 bytes, 2 cycles
    rts             ; 1 byte, 6 cycles
    lda #$01        ; 2 bytes, 2 cycles
    rts             ; 1 byte, 6 cycles

Like the its C namesake, our strcmp doesn’t care about alphabetical order, only the values of the bytes themselves. The subroutine uses just 24 bytes, though, so you can’t ask for too much. (Timing for this one is…nontrivial, so I’ll leave it to the more interested reader.)

Other structures

Arrays, in theory, would work almost like strings. Instead of looking for null bytes, you’d have an explicit count, more like newer C functions such as strncmp. On the 6502, the indirect indexed addressing mode (e.g., LDA ($F0), Y) we’ve used in every example so far is your main tool for this. Other architectures have their own variations, like the x86’s displacement mode.

More complex structures (like C structs or C++ classes), are tougher. This is where the assembly programmer needs a good understanding of how high-level compilers implement such things. Issues like layout, padding, and alignment come into play on modern computers, while the 6502 suffers from the slower speed of indirection.

Self-contained structures (those that won’t be interfacing with higher-level components) are really up to you. The most common layout is linear, with each member of the structure placed consecutively in memory. This way, you’re only working with basic offsets.

But there’s a problem with that, as newer systems don’t really like to access any old byte. Rather, they’ll pull in some number of bytes (always a power of 2: 2, 4, 8, 16, etc.) all at once. Unaligned memory accesses, such as loading a 32-bit value stored at memory location 0x01230001 (using x86-style hex notation) will be slower, because the processor will want to load two 32-bit values—0x01230000 and 0x01230004—and then it has to do a little bit of internal shuffling. Some processors won’t even go that far; they’ll give an error at the first sign of an unaligned access.

For both of these reasons, modern languages generate structures with padding. A C struct containing a byte and a 32-bit word (in that order), won’t take up the 5 bytes you’d expect. No, it’ll be at least 8, and a 64-bit system might even make it 16 bytes. It’s a conscious trade-off of size for speed, and it’s a fair trade in these present days of multi-gigabyte memory. It’s not even that bad on embedded systems, as they grow into the space occupied by PCs a generation ago.

Coming up

For now, I think I’m going to put this series on hold, as I’m not sure where I want it to go. I might move on to a bigger architecture, something like the x86 in its 16-bit days. Until then, back to your regularly scheduled programming posts.

Leave a Reply

Your email address will not be published. Required fields are marked *