This is intended as a high-level overview of lower-level memory management, with a focus on how it interacts with Rust.
When you run a program, it gets loaded into memory. The actual process is beyond the scope of this article, and involves things like dynamic linking, but the end result is that the address space of the program contains the following:
static FOO: i32 = 42;
then FOO
's address would be here. This data is read/write.Box
something in Rust, call new
in C++ or malloc
in C, this is where the value will be stored.The stack is where local variables are stored. When a function is called, the stack grows, from the top of memory (higher addresses), downwards (to lower addresses). When a function returns, the stack shrinks again. In a highly recursive program (or, in theory, one in which a lot of distinct functions are called), one might run out of stack space: this is a stack overflow.
The amount of stack space allocated to a program is set by the OS: 1 MiB on Windows, 8 MiB on Linux or MacOS. There are ways to increase the size of the stack: on Windows, it's typically configured by a compiler flag while Linux and MacOS manage it at runtime through ulimit
.
In a multi-threaded program, each thread has its own stack. Sometimes the stack size of the new thread is smaller than the main thread; other times it's the same. MacOS, for example, only gives new threads 512 KiB of stack space.
There are four main reasons I can think of to try to keep data on the stack, rather than the heap:
There's overhead in using the allocator (more on that later), so it's often more efficient if you can avoid having to do so.
Cleaning up data from the stack is straight-forward: just push up the stack pointer. Depending on the types on the stack, the program may need to run clean-up code (e.g. call destructors in C++, drop
in Rust), but there's no need to deallocate the memory.
Your program needs to touch the stack enough that its top is almost certainly in the L1 cache. This means that accesses to it will (probably) be fast. Note that this doesn't mean access to values on the heap will be slow! There's nothing special about heap memory or stack memory other than typical access patterns.
This isn't universally true, but the compiler can more often be certain that values local to the function can only be accessed by that function, allowing more optimizations. For instance, if there are registers to spare, it may be able to avoid ever storing a value on the stack at all!
The exact usage of the stack depends heavily on the calling conventions for the given CPU architecture, operating system, and even compiler: Windows and Linux follow different, incompatible conventions, and a compiler might use more optimized conventions for functions that aren't part of the program's external API. On x86, there are two registers relating to the stack: EBP
, which points to the bottom of the current stack frame, and ESP
, which points to the top. EBP
is set at the start of a function, while ESP
can move more freely. On x86-64, the registers are known as RBP
and RSP
respectively1.
A naïve example of how the stack is used when a function is called on x86:
call
instruction.EBP
onto the stack, and sets it to the current value of ESP
. This function call's stack frame is now established.ESP
is adjusted to make space for local variables, as well as space for any other registers whose values will be overwritten: calling conventions indicate which registers will not be preserved.EAX
.ESP
to the value stored in EBP
and recover the original value of EBP
from the stack. This might be done explicitly, or by the leave
instruction.ret
function.There are various optimizations that can be applied:
One can omit steps 2 and 6, with the compiler accessing stack values relative to ESP
instead of EBP
, and manually adjusting ESP
at the end. This requires the compiler to track adjustments to ESP
.
Another optimization (seen on x86-64, not x86) is the allow functions to freely use memory below the current value of RSP
, saving the effort of adjusting it. Adjustment will still be necessary if the function calls another function, but can save instructions in leaf functions. This optimization refers to the reserved space below RSP
as "the red zone".
For a more in-depth explanation, see this article on how the stack works on x86, and this related one on the stack on x86-64. The Wikipedia article on x86 calling conventions was also helpful.
Having an in-depth understanding of how the stack works, calling conventions, assembly and ABIs is not necessary, and if you're finding yourself having to dig into it, it's either something you enjoy; you're having a really bad day; or quite likely, both.
The original set of 16-bit registers were AX
, BX
, CX
and DX
, along with special-purpose registers like SP
and BP
, with the *X registers further subdivided into upper and lower 8-bit halves AL
, AH
, etc. When 32-bit registers were added, they were indicated by prepending an E
to the name: EAX
, EBX
, ESP
, etc. to indicate that the register was "Extended". With the move to 64-bit, the E
prefix was replaced with R
, just standing for Register.
The heap is where the bulk of your program's data is stored. Unlike the stack, whose size is limited, you can store as much as you want on the heap, until you either run out of available memory or address space.
The heap doesn't start massive, though: your program needs to request that the OS allocate it additional memory. Depending on the OS and how it's configured, this request might be rejected with an out-of-memory error, or the request might succeed, but attempting to use the memory could result in the process getting killed. The latter is due to a policy known as "overcommit": allowing more memory to be allocated to programs than is available, on the assumption that the programs will either not make full use of that memory, or that some of the programs will end before the others need their memory. Sometimes this assumption fails, and the OS will need to kill processes to free up memory.
When you Box
a value in Rust, you're asking for it to be stored on the heap. Other smart pointers, like Rc
and Arc
, also store their values on the heap.
Some types store data on the heap for you, the most common being Vec
. Given
let list = vec![1, 2, 3, 4];
then list
will be three words in size (a pointer to the data, the current length of the Vec
and its capacity; 24 bytes on 64-bit systems), while on the heap there would be 16 bytes allocated to store the Vec
's contents: 1, 2, 3, 4. Moving a Vec
will only move the pointer, length and capacity; the contents will remain where they are on the heap.
Although not true for all container types in Rust, some like Vec
have an optimization where an empty Vec
created by Vec::new()
or vec![]
doesn't allocate.
Vec
sWhen the capacity of a Vec
is exceeded, it makes a new, larger allocation, copies the current contents from the old allocation to the new, and then frees the old allocation. Each time it grows this way, its capacity is doubled.
In the case that the Vec
is empty, doubling obviously won't work. In this case, the initial non-zero capacity will be depend on the size of the values being stored in the Vec
. If they're single-byte, then the capacity will be set to 8; 4 if the values are less than or equal to one kibibyte in size, and 1 otherwise.
While the push that causes the Vec
to exceed its capacity will be expensive, by doubling in size each time the average cost of pushing an element is constant. Still, if you know that your Vec
will reach a certain size, it can be useful to initialize it with Vec::with_capacity
to avoid unnecessary reallocations.
While your program could manage making requests to the OS for memory and deciding what data goes where, more often this is a task delegated to a memory allocator. There is typically a standard allocator provided by the OS (e.g., the malloc
in glibc on Linux), but you can use alternate allocators like jemalloc
(tu t'allocs, etc.), or even write your own. If you're writing for an embedded system, or WASM, there is no system allocator, so you'll need to provide your own, often as a library.
Traditionally, malloc
on Unix would use the brk
or sbrk
syscalls to request memory; these days, mmap
is used instead. On Windows, the call is VirtualAlloc
.
Writing an allocator isn't terribly difficult. The trick is writing a good allocator.
A good allocator should be fast. Most allocators need to track what memory is available, and decide which allocation to use. There is a lot of bookkeeping. On the other hand, a bump allocator is incredibly simple: track the position of the next possible allocation, and available space. Rust's documentation includes the implementation of a simple bump allocator that works from a fixed pool of memory.
A more sophisticated allocator will take efforts to reuse free
'd allocations, rather than leaking them. Note that freed memory is rarely returned to the operating system. The strategies used to manage freed or not-yet-allocated memory is much of what distinguishes one allocator from another. The toy allocator linked above could be modified to maintain a list of available allocations. When a request is made, it could walk the list for a large-enough allocation, split off as much space as was requested, and return it. When memory is freed, the new space is added to the free list. An even more clever implementation would glue adjacent free allocations back together.
This modified allocator would work, but would suffer from the problem of fragmentation: multiple allocations are made, and one of the ones in the middle is freed, leaving a hole that can only be filled by an allocation of the same size or smaller. A smaller allocation takes some of that space, and now the original allocation can no longer fit there. Over time, the result can be a process with plenty of free memory, but no contiguous blocks large enough for its needs.
One approach to solve this problem used by modern allocators like jemalloc
is to maintain multiple free lists. In particular, dividing allocations up into separate size classes, with each size class managed by its own free list. By grouping similarly sized allocations together in memory, and rounding allocation sizes up as necessary, you can reduce fragmentation and also simplify free-list management. For example, jemalloc
might have size classes at 128 bytes, 160 bytes, 192 bytes, 224 bytes and 256 bytes (as well as larger and smaller size classes, but no others in this range). An allocation of 129 bytes will be rounded up to 160, for a wasteage of just under 20%. An allocation of 225 bytes would be rounded to 256 bytes, with 12.1% waste. These are the worst cases!
Other modern allocators, like Google's tcmalloc
and Microsoft's mimalloc
also use size classes, with further optimizations to reduce contention from multi-threading, and improving cache locality.
Garbage collected languages use memory allocators too: after all, they often need to put more values on the heap than languages like Rust! Some, like Python, use reference counting as a first line of garbage collection, with calls to malloc
and free
on the underlying allocator.
Compacting garbage collectors can have very fast allocations using a bump allocator. The garbage collector periodically cleans up inaccessible allocations, and defragments the remainder.