Avatar of Adam Fields

Memory in Computer Programs

| 11 minutes

Systems programming is writing software that closely interacts with the device’s hardware and system services. While you may never write embedded software or operating systems, understanding what happens when you initialize a variable or call a function is a good thing.

Motivation

In my experience as a web developer, memory only came up when the program ran out of it. Long-running Node.js services would gradually consume more memory until PM2 or Kubernetes restarted them. In React, not cleaning up event listeners or intervals when components unmount would cause memory leaks. In most cases, following well-established patterns and best practices is enough to avoid memory issues. This is mainly because many languages have garbage collectors that automatically manage memory.

I wasn’t super focused on memory management until I started building my own image generation tools where large models get swapped in and out of RAM. Not only do you need enough memory, but it has to be contiguous (a single block). This requires a high-level understanding of how Python manages memory and also PyTorch’s CUDA cache.

This post is the result of my research into memory management, illustrated by some examples in C. I’ll also cover garbage collection in a few languages, and introduce Rust and Zig’s approaches to memory safety. The intended audience is developers looking to build intuition about how their programs work under the hood.

Computer Memory

There are two types of memory in a computer: volatile and non-volatile.

RAM (random access memory) is your computer’s main working memory and is volatile. This means that when the computer is turned off, the data in RAM is lost. When you launch a program, it is loaded into RAM from storage (disk) so your computer can access it quickly.

ROM (read-only memory) is non-volatile and retains data even when the computer is turned off. It is used to store permanent instructions like the BIOS (basic input/output system), which is necessary for your computer to turn on. ROM used to be truly immutable after manufacture, but modern computers use flash memory that can be updated.

Software doesn’t actually interact directly with RAM; it uses the virtual address space abstraction.

Virtual Address Space

The virtual address space (VAS) is the range of memory addresses that a process can use. Each process has its own VAS, which is isolated from other processes. The OS maps the VAS to physical memory addresses using a memory management unit (MMU).

Virtual Memory

In modern operating systems, paging or swapping is a technique where the OS moves same-size blocks of memory called pages between RAM and disk. This enables an abstraction known as virtual memory, where the OS can allocate more memory to a process than is available.

Accessing a page requires a lookup to find the physical address. When a process tries to access a memory page that is not in RAM, the MMU raises a page fault. The kernel handles the exception by loading the page from disk into RAM and updating the page table.

When a system spends more time swapping pages than executing instructions, it is said to be thrashing. As the working set grows, the system spends more time handling page faults, until reaching a point where it is overwhelmed.

The working set is the amount of memory a process is using at a given time interval. The concept was introduced by Peter Denning at MIT in The Working Set Model for Program Behavior. The model states that a process can only be in RAM if its working set can fit (all or nothing). To prevent thrashing, the system will swap entire processes in and out of memory rather than try to keep all processes in memory at once and swap individual pages.

Memory Access Patterns

Memory can be accessed in different patterns. Sequential access (reading memory addresses in order) is typically faster than random (jumping around) because modern CPUs and cache systems are optimized for locality of reference.

The locality of reference is the tendency for programs to access the same memory locations repeatedly. There are different types of locality such as:

  • Temporal locality - If a memory location is accessed, it will likely be accessed again soon. For example, variables in a loop.
  • Spatial locality - If a memory location is accessed, nearby memory locations will likely be accessed soon. For example, array elements.

Caches are specifically designed to exploit both types of locality. Recently accessed data is stored in a faster cache layer (temporal), and adjacent memory addresses are pre-fetched in anticipation of future access (spatial).

Caches

Caches are small, fast memory stores that hold frequently accessed data. Data is moved from main memory to cache in fixed-size blocks known as cache lines.

Modern CPUs have multiple levels of cache memory, typically L1, L2, and L3. For perspective, a L1 lookup only takes a few CPU cycles, while a main memory lookup can take hundreds.

Memory Addresses

The memory addresses are hexadecimal, which means base (radix) 16. The Greek héx means 6 and the Latin decem means 10. The letters A-F represent the numbers 10-15. The 0x prefix is a convention from C to denote hexadecimal numbers, but it’s not part of the number itself.

In the x86 architecture, 32-bit addresses have 8 digits, while 64-bit addresses have 16. A single 32-bit address can represent bytes (4gb) of memory. Each digit is known as a hex digit or nibble (half a byte).

Note

For simplicity, I’ll use 32-bit notation.

Address Space Layout Randomization

While not a memory management technique, address space layout randomization (ASLR) is a security technique that randomizes the memory layout of a process. This makes it harder for an attacker to exploit memory vulnerabilities because they can’t predict where things are going to be in memory. This requires the attacker to make a guess about the memory layout, and an incorrect guess will simply crash the program.

For example, ROP (return-oriented programming) is an exploit where the attacker finds gadgets (short sequences of instructions) in the program’s memory and chains them together to execute arbitrary code. Randomizing the addresses makes it harder to find these gadgets.

Memory Layout of a C Program

+------------------+  Highest address (0xFFFFFFFF)
|   Kernel Space   |  <- Reserved for OS
+------------------+
|    Call Stack    |  <- Function calls, local variables (LIFO)
|   [environment]  |     - Environment variables
|    [argv/argc]   |     - Command line arguments and count
|                  |     ↓ (grows downward)
|                  |
|                  |     ↑ (grows upward)
|       Heap       |  <- Dynamic allocation (malloc, free)
+------------------+
|       Data       |  <- Uninitialized variables
|  Segment (BSS)   |
+------------------+
|       Data       |  <- Initialized variables
|     Segment      |
+------------------+
|       Code       |  <- Program code (read-only)
|     Segment      |
+------------------+  Lowest address (0x00000000)

Process Memory

A process is an instance of a program running on a computer. Each process has its own VAS. The code is marked read-only and executable, while the data segments are read-write.

Programs often use shared libraries (.dll on Windows, .so on Linux), which are loaded into memory once and shared across processes. This is known as shared memory. Process memory is organized in different memory segments.

Segments organize program data based on how it’s used and accessed. This organization enables proper management and protection. Each segment has a specific purpose and permissions (read, write, execute). Segments are also used in object files to store different types of data. The below examples are in NASM syntax.

Code Segment

The code segment (also called text segment) contains the program’s executable instructions. In C, this would be the compiled machine code. The code segment is read-only to prevent accidental modification during execution. It is located at the lowest memory address.

Here’s a simple program that exits with code 0:

int main() {
    return 0;
}
section .text
    global _start
_start:
    mov eax, 1  ; syscall number for "exit"
    mov ebx, 0  ; exit code
    int 0x80    ; linux syscall interrupt

Data Segment

Variables initialized at compile time are stored in the data segment. Since these values are known at compile time, they’re stored directly in the executable file. For example:

char message[] = "Hello, World!";
section .data
    message db 'Hello, World!', 0  ; null-terminated string

Each initialized global variable adds to the size of your program on disk.

BSS

The block started by symbol (BSS) segment holds uninitialized variables and variables explicitly initialized to zero. Instead of storing zeros in the executable file, the BSS section just records how much zero-initialized memory is needed. The actual memory is allocated and zeroed when the program starts. This saves space in the executable file.

Whether data goes in the BSS or initialized data section is determined by how the variable is declared and initialized. If a variable is initialized to 0, it will also go in the BSS as a compiler optimization. Local variables in functions are stored on the stack, not in the data segment.

The following C program declares a global variable but doesn’t initialize it:

int count;
section .bss
    count resb 4  ; reserve 4 bytes

Segmentation Fault

A segmentation fault (segfault) occurs when a program tries to access a memory location that it isn’t allowed to access. The MMU detects this and generates a hardware exception. The OS catches the exception and sends a SIGSEGV signal to the program, which terminates it (you wouldn’t want a program to continue running after it tried to access memory it shouldn’t).

Dereferencing (accessing) a NULL pointer and accessing an array index that is out-of-bounds are common causes of segmentation faults.

Beyond segfaults, other types of memory errors include:

  • Memory leaks - Failing to free memory that is no longer needed.
  • Buffer overflows - Trying to write past the end of an array.
  • Dangling pointers - Trying to access memory that has been freed or is out of scope.

Call Stack

The call stack manages function execution in a program. When you call a function, the program needs to remember where to return to and handle local variables. The stack provides an elegant solution: it keeps track of function calls in the order they occur (LIFO). The stack grows downward, meaning that the top of the stack has a lower memory address than the bottom.

Stack Frames

Each function call creates a stack frame containing:

  • Return address (where to continue executing after the function returns)
  • Parameters passed to the function
  • Local variables
  • Saved register values

When a function returns, its stack frame is removed and execution continues at the saved return address. For example:

void greet(const char* name) {
    char message[100];
    sprintf(message, "Hello, %s!", name);
    printf("%s\n", message);
}

int main() {
    greet("Adam");
    return 0;
}

During execution, the call stack might look like this:

+------------------+  <- Stack pointer (ESP)
|  greet()         |  <- Current stack frame
|  message[100]    |     - local variable
|  name            |     - parameter
|  return address  |     - return address to main()
+------------------+  <- Base pointer (EBP)
|  main()          |  <- Caller's frame
|  return address  |     - return address to OS
+------------------+

Stack Overflow

The call stack has a limited size, and if a program uses more stack space than is available, it will cause a stack overflow. This can happen if a program has too many nested function calls (recursion) or if it declares a huge array as a local variable.

Heap

The heap is a region of memory used for dynamic memory allocation. The heap grows upward, meaning that the top of the heap has a higher memory address than the bottom. The heap is manually managed by the programmer.

Heap vs Stack Allocation

Local variables with fixed sizes, including arrays and strings declared within functions, are allocated on the stack. Unlike heap allocations, their size must be known at compile-time. This is fast because moving the stack pointer is simple arithmetic on a CPU register. However, the stack is limited in size and can’t grow dynamically.

void example() {
    int numbers[10];           // fixed-size array on stack
    char message[100];         // fixed-size string on stack
    double matrix[1000][1000]; // too large for stack!
}

When working with data from external sources like a web API, heap allocation is necessary since the response size is unpredictable.

// In reality you'd use libcurl
char* fetch_data() {
    size_t response_size = get_content_length_from_headers(); // hypothetical function
    char* data = malloc(response_size);
    return data;
}

Dynamic Memory Allocation

In C, malloc() is used to allocate memory on the heap, and free() is used to release it:

// Allocate memory for 100 integers
int* numbers = (int*)malloc(100 * sizeof(int));
if (numbers == NULL) {
    return 1; // exit with error
}

// Use the allocated memory
for (int i = 0; i < 100; i++) {
    numbers[i] = i;
}

// Free the allocated memory
free(numbers);

// A good practice is to set the pointer to NULL after freeing it
// This can prevent use-after-free bugs
numbers = NULL;

Unlike stack memory, heap memory is not automatically deallocated and must be manually freed. Failure to do so can lead to memory leaks. It’s a good practice to check if malloc() returns NULL, and always pair malloc() with free().

In CUDA, GPU memory is managed similarly with cudaMalloc() and cudaFree().

Memory Arenas

A memory arena is a technique where a large block of memory is pre-allocated and then divided into smaller blocks for use by the program. Instead of making small allocations from the system allocator, the program allocates from the arena. Go has been experimenting with adding a public Arena API in the standard library.

PyTorch uses a caching allocator to manage GPU memory internally. This is similar to an arena in that it pre-allocates a large block of memory and then divides it into smaller blocks for use by the tensors. When tensors are deleted, the blocks are not immediately released. This is why you might see more memory used than you expect when running nvidia-smi, and also why you sometimes need to call torch.cuda.empty_cache().

Fragmentation

When free memory is scattered in small blocks throughout the heap, it is fragmented. This makes it difficult to allocate large contiguous blocks of memory, leading to increased allocation times. Fragmentation can be internal (unused memory within an allocated block) or external (unused memory between allocated blocks). A common cause of fragmentation is using too many dynamic-sized allocations, where the size of an object is not known at compile time.

Garbage Collection

In languages like JavaScript and Python, the runtime environment includes a garbage collector (GC) that automatically frees memory when objects are no longer referenced or reachable.

Modern garbage collectors use a combination of strategies. For example, both JavaScript (V8) and Python use a generational garbage collector based on the assumption that most objects die young. The heap is divided into generations, and objects are promoted to older generations if they survive a collection cycle. The older generations are collected less frequently.

Python uses reference counting to track object references. When an object’s reference count reaches zero, it is immediately deallocated. However, reference counting can’t handle circular references, so Python also uses a cycle detector to break reference cycles. For example:

class Node:
    def __init__(self):
        self.next = None

# Create a cycle
a = Node()
b = Node()
a.next = b
b.next = a

When a and b go out of scope, reference counting alone won’t free them, since they still reference each other. The cycle detector will detect this and break the cycle.

Java’s JVM also uses a generational garbage collector. The JVM offers several garbage collection algorithms optimized for different scenarios. For example, the default G1 (garbage-first) collector divides the heap into regions and prioritizes collecting regions with the most garbage first. Modern collectors like ZGC and Shenandoah aim to maintain consistent low-latency.

Go runs a concurrent garbage collector in parallel with the program. It uses a tri-color marking algorithm to track object reachability. Sweeping happens lazily to minimize latency.

This HN thread offers perspectives on garbage collection from people who make video games and airplanes. This post from Uber reveals their approach to JVM garbage collection optimization. This post from Discord explains how they had to read Go’s source code to diagnose recurring latency spikes.

Smart Pointers

In C++, smart pointers are a way to manage memory automatically without garbage collection. Smart pointers are objects that wrap raw pointers and automatically free memory when the object is destroyed (goes out of scope).

There are three types of smart pointers in C++:

  • std::unique_ptr - Manages a single object.
  • std::shared_ptr - Several shared pointers can “own” the same object, and the object is deleted when the last shared pointer is destroyed.
  • std::weak_ptr - A non-owning (“weak”) reference to an object managed by a std::shared_ptr.

Ownership and Borrowing

Rust is a memory-safe systems programming language with a distinctive ownership and borrowing system. In a Rust program, each value has a single owner, and when the owner goes out of scope, the value is dropped (freed).

Borrowing allows you to reference data without taking ownership. There are two types of borrows in Rust:

  • Shared (&) - Allows multiple readers but no writers.
  • Mutable (&mut) - Allows a single writer.

Note

Read The Dark Arts of Unsafe Rust to learn about the unsafe keyword to bypass Rust’s safety guarantees.

Allocators

Zig is another systems programming language that takes a bit of a middle-ground between C and Rust. Zig provides an Allocator API in the standard library. There are multiple allocators available, including a general purpose allocator (GPA) that can prevent double-free and use-after-free bugs and detect memory leaks. The standard library also includes an ArenaAllocator struct, which is a memory arena implementation that can allocate memory in a single block and free it all at once.

Benjamin Feng has a great video on memory management and allocators in Zig:

Registers

A register is a small amount of fast memory that is part of the CPU. In the x86 architecture, there are 8 and 16 general-purpose registers (GPR) for x86-32 and x86-64, respectively. Registers are used to store data that is actively being processed.

In C, to access the registers directly, you would use inline assembly. Inline assembly is a way to write assembly code directly in a C program. This is useful for performance-critical code where you need to optimize the code at the instruction level.

Below is a diagram of the x86 (32-bit) registers:

+-------------------+-------------------------------------------+
| EAX (accumulator) | Used for arithmetic and return values     |
+-------------------+-------------------------------------------+
| EBX (base)        | General-purpose base pointer              |
+-------------------+-------------------------------------------+
| ECX (counter)     | Loop counter, string operations           |
+-------------------+-------------------------------------------+
| EDX (data)        | I/O operations, multiplying large values  |
+-------------------+-------------------------------------------+
| ESI (source idx)  | Points to source in string ops            |
+-------------------+-------------------------------------------+
| EDI (dest idx)    | Points to destination in string ops       |
+-------------------+-------------------------------------------+
| ESP (stack ptr)   | Points to top of the current stack frame  |
+-------------------+-------------------------------------------+
| EBP (base ptr)    | Points to base of the current stack frame |
+-------------------+-------------------------------------------+

Register Renaming

In modern CPUs, a logical register is mapped to a set of physical registers. The CPU can re-map logical registers to physical registers as needed at runtime, which is known as register renaming. This enables parallel and out-of-order execution of instructions.

Additional Resources

Aside from the in-article links, I recommend following-up with these:

Conclusion

I learned a TON researching this post.

If you made it this far, follow me on GitHub and Hugging Face. I like to build apps, curate interesting projects, and write about what I learn.

And if you’re curious, my favorite song on Random Access Memories is Giorgio by Moroder.