Software internals: Arrays

I’m a programmer. I think that should be obvious. Even though most of my coding these days is done at a pretty high level, I still enjoy the intricacies of low-level programs. It’s great that we have ways to hide the complexities inherent in software, but sometimes it’s fun to peel back the layers and look at just how it works. That’s the idea behind this little series of posts. I want to go into that deeper level, closer to the metal. First up, we’ll take a peek at that simplest and dumbest of data structures: the array.

Array, array

At the basic level, an array is nothing more than a sequence of values. Different languages have different rules, especially regarding types, but the idea is always the same. The values in an array are its elements, and they can be identified by an index. In C, for example, a[0] has the index 0, and it refers to the first element of the array named a. (C-style languages start counting from 0. We’ll see why shortly.)

For the purposes of this post, we’ll start with the simplest kind of array, a one-dimensional array whose values are all of the same type—integers, specifically. Later, we can expand on this, but it’s best to start small. Namely, we’ll have an array a with four values: {1, 2, 3, 4}. Also, we’ll mainly be using lower-level languages like C and C++, since they give the best look at how the code really runs.

In memory

One of the main reasons to use something like C++ is because of memory concerns, so let’s look at how such an array is set up in memory. On my PC, using 64-bit Debian Linux and GCC 5.3, it’s about as simple as can be. The compiler knows all the values beforehand, so all it does is put them in a “read-only data” section of the final executable. (In the assembly output, this shows up as .long statements in the .rodata section.) The elements of the array are in contiguous locations; that’s not just required by the C standard, but by the very definition of an array. It also makes them fast, especially when cache comes into play.

In C++, 4 integers in an array take up the space of, well, 4 integers. On a 64-bit system, that’s 32 bytes, half that if you’re still on 32-bit. There’s no overhead, because an array at this level is literally nothing more than a sequence of memory locations.

That contiguous layout makes working with the array trivial. Given an array a or n-byte elements, the first element—index 0—is at the same address as the array itself (&(a[0]) == &a in C parlance). To find any other one, all you have to do is multiply the index by the size of each element: &(a[i]) == &a + i * sizeof(int). Addition is just about the fastest thing a processor does, and word sizes as powers of 2 mean that the multiplication is nothing more than a bit shift, so array indexing is hard to beat.

Copying these arrays is easy, too: copy each element, and you’re done. Want to compare them? Nothing more than going down the line, looking for differences. Sure, that takes linear—O(n)—time, but it’s a great start. Of course, there are downsides, too. Arrays like this are fixed in size, and they all have to be the same type.

Complicating matters

There’s not much more to be said for the humble array, so let’s add some kinks. To start, what do you do if you don’t know all the values to begin with? Then, you need an uninitialized array, or a buffer. Compilers typically use a trick called a BSS segment to make these, while higher-level languages tend to initialize everything to a null value. Either way, all you really get is a block of memory that you’re expected to fill in later.

Changing the other assumptions of the array (fixed size and type) means changing the whole structure. Dynamically-sized arrays, like C++’s vector, need a different way of doing things. Usually, this means something like having an internal array—with a bit of room to grow—and some bookkeeping data. That gets into dynamic memory allocation, another topic for later, but from the programmer’s point of view, they work the same way. In fact, vector is required to be a drop-in replacement for arrays. (If you want arrays where elements can be of different types, like in JavaScript, then you have to abandon the simple mathematical formula and its blazing speed. At that point, you’re better off ditching the “array” concept completely.)

Moving up to higher levels doesn’t really change how an array functions. At its core, it’s still a sequence of values. One of JavaScript’s newer features is the typed array, which is exactly that. It’s intended to be used where speed is of the essence, and it’s little more than a friendlier layer on top of C-style arrays.

Implementation details

Just about every usable language already has something like an array, so there’s almost never a need to make one yourself. Indeed, it’s nearly impossible to do so. But maybe you’re working in assembly language. There, you don’t have the luxury.

Fixed-size arrays are nothing more than blocks of memory. If your array has n elements, and each one is size s, then you need to reserve n * s bytes of memory. That’s it. There’s your array. If you need it initialized, then fill it in as necessary.

Element access uses the formula from above. You need to know the address a of the array, the element size s, and the index i. Then, accessing an element is nothing more than loading the value at a + i * s. Note, though, that this means elements are numbered starting at 0. (And it’s exactly why, for that matter.)

Since arrays are dumb, you can pass them around as blocks of memory, but you always need to know their size. If you’re not careful, you can easily get buffer overflows and other out-of-bounds conditions. That’s the reason why so many “safe” C functions like snprintf take an extra “maximum size” argument. The array-as-memory-block notion, incidentally, is why C lets you treat pointers and arrays as the same thing.

The end

The array, in whatever form, is the most basic of data structures, so it made for a good starting point. I hope it set the tone for this series. Later on, I’ll get into more complicated structures and algorithms, like linked lists, sorting, and so on. It’s all stuff that programmers in something like JavaScript never need to worry about, but it’s fun to peek under the hood, isn’t it?

Leave a Reply

Your email address will not be published. Required fields are marked *