The stack, heap and allocators

2022-09-14

/code #rust #concepts #memory

This is intended as a high-level overview of lower-level memory management, with a focus on how it interacts with Rust.

Segments of your Program

When you run a program, it gets loaded into memory. The actual process is beyond the scope of this article, and involves things like dynamic linking, but the end result is that the address space of the program contains the following:

The machine code instructions for your program. This memory is marked as read-only and executable.
Read-only constant data, like string literals. This memory is marked as read-only.
Memory for (constant) initialized global or static variables. If you write
```
static FOO: i32 = 42;
```
then FOO's address would be here. This data is read/write.
Uninitialized global or static variables. These memory pages are all initialized to zero, so their initial values are knowable (unlike uninitialized local function variables), but might not be valid values for their type.
The heap. When you Box something in Rust, call new in C++ or malloc in C, this is where the value will be stored.
The stack. The stack typically lives at the opposite end of the address space from the heap, with the two growing towards each other. The stack is used to store function-local data: variables, but also metadata like return addresses.
The environment: environment variables,

The Stack

Vital details

The stack is where local variables are stored. When a function is called, the stack grows, from the top of memory (higher addresses), downwards (to lower addresses). When a function returns, the stack shrinks again. In a highly recursive program (or, in theory, one in which a lot of distinct functions are called), one might run out of stack space: this is a stack overflow.

The amount of stack space allocated to a program is set by the OS: 1 MiB on Windows, 8 MiB on Linux or MacOS. There are ways to increase the size of the stack: on Windows, it's typically configured by a compiler flag while Linux and MacOS manage it at runtime through ulimit.

In a multi-threaded program, each thread has its own stack. Sometimes the stack size of the new thread is smaller than the main thread; other times it's the same. MacOS, for example, only gives new threads 512 KiB of stack space.

Why the stack is important

There are four main reasons I can think of to try to keep data on the stack, rather than the heap:

You don't need to allocate memory

There's overhead in using the allocator (more on that later), so it's often more efficient if you can avoid having to do so.

You don't need to de-allocate memory

Cleaning up data from the stack is straight-forward: just push up the stack pointer. Depending on the types on the stack, the program may need to run clean-up code (e.g. call destructors in C++, drop in Rust), but there's no need to deallocate the memory.

It's all close to each other

Your program needs to touch the stack enough that its top is almost certainly in the L1 cache. This means that accesses to it will (probably) be fast. Note that this doesn't mean access to values on the heap will be slow! There's nothing special about heap memory or stack memory other than typical access patterns.

It's easier for the compiler to reason about

This isn't universally true, but the compiler can more often be certain that values local to the function can only be accessed by that function, allowing more optimizations. For instance, if there are registers to spare, it may be able to avoid ever storing a value on the stack at all!

The nitty-gritty

The exact usage of the stack depends heavily on the calling conventions for the given CPU architecture, operating system, and even compiler: Windows and Linux follow different, incompatible conventions, and a compiler might use more optimized conventions for functions that aren't part of the program's external API. On x86, there are two registers relating to the stack: EBP, which points to the bottom of the current stack frame, and ESP, which points to the top. EBP is set at the start of a function, while ESP can move more freely. On x86-64, the registers are known as RBP and RSP respectively¹.

A naïve example of how the stack is used when a function is called on x86:

Calling conventions dictate which registers may be used to pass function arguments. If there are more function arguments than available registers, the remaining arguments are stored on the stack.
The caller pushes the return address onto the stack, and jumps to the start of the called function. In x86 assembly, both steps are performed by the call instruction.
The called function records the old value of EBP onto the stack, and sets it to the current value of ESP. This function call's stack frame is now established.
ESP is adjusted to make space for local variables, as well as space for any other registers whose values will be overwritten: calling conventions indicate which registers will not be preserved.
On return, store the result in EAX.
The callee restores ESP to the value stored in EBP and recover the original value of EBP from the stack. This might be done explicitly, or by the leave instruction.
Pop the address on the top of the stack, and jump to it. This is done by the ret function.
Some calling conventions require that the caller clean up any function arguments from the stack; others require the callee to do so.

There are various optimizations that can be applied:

One can omit steps 2 and 6, with the compiler accessing stack values relative to ESP instead of EBP, and manually adjusting ESP at the end. This requires the compiler to track adjustments to ESP.

Another optimization (seen on x86-64, not x86) is the allow functions to freely use memory below the current value of RSP, saving the effort of adjusting it. Adjustment will still be necessary if the function calls another function, but can save instructions in leaf functions. This optimization refers to the reserved space below RSP as "the red zone".

For a more in-depth explanation, see this article on how the stack works on x86, and this related one on the stack on x86-64. The Wikipedia article on x86 calling conventions was also helpful.

Having an in-depth understanding of how the stack works, calling conventions, assembly and ABIs is not necessary, and if you're finding yourself having to dig into it, it's either something you enjoy; you're having a really bad day; or quite likely, both.

The original set of 16-bit registers were AX, BX, CX and DX, along with special-purpose registers like SP and BP, with the *X registers further subdivided into upper and lower 8-bit halves AL, AH, etc. When 32-bit registers were added, they were indicated by prepending an E to the name: EAX, EBX, ESP, etc. to indicate that the register was "Extended". With the move to 64-bit, the E prefix was replaced with R, just standing for Register.

The Heap

The heap is where the bulk of your program's data is stored. Unlike the stack, whose size is limited, you can store as much as you want on the heap, until you either run out of available memory or address space.

The heap doesn't start massive, though: your program needs to request that the OS allocate it additional memory. Depending on the OS and how it's configured, this request might be rejected with an out-of-memory error, or the request might succeed, but attempting to use the memory could result in the process getting killed. The latter is due to a policy known as "overcommit": allowing more memory to be allocated to programs than is available, on the assumption that the programs will either not make full use of that memory, or that some of the programs will end before the others need their memory. Sometimes this assumption fails, and the OS will need to kill processes to free up memory.

When you Box a value in Rust, you're asking for it to be stored on the heap. Other smart pointers, like Rc and Arc, also store their values on the heap.

Some types store data on the heap for you, the most common being Vec. Given

let list = vec![1, 2, 3, 4];

then list will be three words in size (a pointer to the data, the current length of the Vec and its capacity; 24 bytes on 64-bit systems), while on the heap there would be 16 bytes allocated to store the Vec's contents: 1, 2, 3, 4. Moving a Vec will only move the pointer, length and capacity; the contents will remain where they are on the heap.

Although not true for all container types in Rust, some like Vec have an optimization where an empty Vec created by Vec::new() or vec![] doesn't allocate.

A note on growing `Vec`s

When the capacity of a Vec is exceeded, it makes a new, larger allocation, copies the current contents from the old allocation to the new, and then frees the old allocation. Each time it grows this way, its capacity is doubled.

In the case that the Vec is empty, doubling obviously won't work. In this case, the initial non-zero capacity will be depend on the size of the values being stored in the Vec. If they're single-byte, then the capacity will be set to 8; 4 if the values are less than or equal to one kibibyte in size, and 1 otherwise.

While the push that causes the Vec to exceed its capacity will be expensive, by doubling in size each time the average cost of pushing an element is constant. Still, if you know that your Vec will reach a certain size, it can be useful to initialize it with Vec::with_capacity to avoid unnecessary reallocations.

Memory allocators

While your program could manage making requests to the OS for memory and deciding what data goes where, more often this is a task delegated to a memory allocator. There is typically a standard allocator provided by the OS (e.g., the malloc in glibc on Linux), but you can use alternate allocators like jemalloc (tu t'allocs, etc.), or even write your own. If you're writing for an embedded system, or WASM, there is no system allocator, so you'll need to provide your own, often as a library.

Traditionally, malloc on Unix would use the brk or sbrk syscalls to request memory; these days, mmap is used instead. On Windows, the call is VirtualAlloc.

Writing an allocator isn't terribly difficult. The trick is writing a good allocator.

A good allocator should be fast. Most allocators need to track what memory is available, and decide which allocation to use. There is a lot of bookkeeping. On the other hand, a bump allocator is incredibly simple: track the position of the next possible allocation, and available space. Rust's documentation includes the implementation of a simple bump allocator that works from a fixed pool of memory.

A more sophisticated allocator will take efforts to reuse free'd allocations, rather than leaking them. Note that freed memory is rarely returned to the operating system. The strategies used to manage freed or not-yet-allocated memory is much of what distinguishes one allocator from another. The toy allocator linked above could be modified to maintain a list of available allocations. When a request is made, it could walk the list for a large-enough allocation, split off as much space as was requested, and return it. When memory is freed, the new space is added to the free list. An even more clever implementation would glue adjacent free allocations back together.

This modified allocator would work, but would suffer from the problem of fragmentation: multiple allocations are made, and one of the ones in the middle is freed, leaving a hole that can only be filled by an allocation of the same size or smaller. A smaller allocation takes some of that space, and now the original allocation can no longer fit there. Over time, the result can be a process with plenty of free memory, but no contiguous blocks large enough for its needs.

One approach to solve this problem used by modern allocators like jemalloc is to maintain multiple free lists. In particular, dividing allocations up into separate size classes, with each size class managed by its own free list. By grouping similarly sized allocations together in memory, and rounding allocation sizes up as necessary, you can reduce fragmentation and also simplify free-list management. For example, jemalloc might have size classes at 128 bytes, 160 bytes, 192 bytes, 224 bytes and 256 bytes (as well as larger and smaller size classes, but no others in this range). An allocation of 129 bytes will be rounded up to 160, for a wasteage of just under 20%. An allocation of 225 bytes would be rounded to 256 bytes, with 12.1% waste. These are the worst cases!

Other modern allocators, like Google's tcmalloc and Microsoft's mimalloc also use size classes, with further optimizations to reduce contention from multi-threading, and improving cache locality.

Garbage collection

Garbage collected languages use memory allocators too: after all, they often need to put more values on the heap than languages like Rust! Some, like Python, use reference counting as a first line of garbage collection, with calls to malloc and free on the underlying allocator.

Compacting garbage collectors can have very fast allocations using a bump allocator. The garbage collector periodically cleans up inaccessible allocations, and defragments the remainder.