06 - Resource Management & Processes

# Chapter 6: Resource Management & Processes ## 6.0 - Memory & Disk (The Hierarchy of Inertia) <img src="assets/arch_resource_hunger.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Every piece of data in your database has weight, and moving that weight between physical layers is the central problem of database performance. We call this the **Hierarchy of Inertia**. Postgres cannot make RAM faster or disk cheaper; it can only decide—aggressively and constantly—how to budget its limited energy to keep the most relevant data as close to the CPU as possible. Performance is not about making things "fast"; it is about minimizing the distance data must travel. ### What You'll Learn - Why Postgres uses a **Process-Per-Connection** model and the fault isolation trade-off - How the **Latency Stack** (L1 cache → RAM → SSD → HDD) governs every performance decision - How `shared_buffers`, `work_mem`, and `maintenance_work_mem` control memory allocation - Why **Autovacuum** is critical for preventing tuple bloat and XID wraparound To understand these decisions, you need a clear picture of the **Resource Hierarchy**: the physical stack of processing cores, memory, and storage that every query touches. ### The Latency Stack To understand these decisions, you must understand the physics of the machine. Hardware latency is not linear. When a worker needs data, the physical distances are enormous, and each step down the hierarchy increases the inertia by orders of magnitude: | Layer | Typical Latency | Notes | | :--------------------- | :-------------- | :----------------------------------------- | | **CPU L1 Cache** | ~1.5 ns | On-chip; 32–64 KB per core | | **CPU L2 Cache** | ~7 ns | Still on-chip; 256 KB – 1 MB per core | | **Main Memory (RAM)** | ~100 ns | DIMM; fast, but 100× slower than L1 | | **NVMe SSD** | ~100 µs | ~1,000× slower than RAM | | **SATA SSD** | ~500 µs | ~5,000× slower than RAM | | **Spinning HDD** | ~10 ms | ~100,000× slower than RAM; rotational seek | The numbers matter more than any analogy. A random HDD read is five orders of magnitude slower than fetching from RAM. The gap between "data is cached" and "data is not cached" isn't a minor inconvenience—it is a change in the physical state of the query. This is why the engine's entire design is a fight against inertia: **keep the working set in RAM**. But to use that RAM, the engine must coordinate how independent processes access it. ### The Process Model (One Core Per Connection) Unlike many modern runtimes that use lightweight threads, Postgres uses a **Process-Per-Connection** model. This is not a historical accident—it is a deliberate architectural choice for **fault isolation**. If a backend process crashes (due to a buggy extension, an out-of-memory kill, or a corrupted page), it dies alone. The postmaster simply cleans up the shared memory it touched and moves on. When you connect, the OS spawns an entirely independent backend process. Every active query is handled end-to-end by this single worker, and a single process can only execute on a single physical CPU core. **Throwing a 64-core machine at a single slow query will do absolutely nothing**—one core will hit 100% utilization, and the other 63 will sit idle. If you have 100 active connections, you have 100 independent processes, each consuming OS memory and competing for CPU scheduler time. > [!NOTE] The Click > **Concept**: A database connection is a physical operating system process, not a lightweight thread. > **Payoff**: Because Postgres isolates connections into separate processes, it gains immense crash resilience at the cost of high operating system overhead. A single query cannot utilize multiple CPU cores unless the query planner explicitly spawns parallel background workers. If you have 64 CPU cores, you cannot make one single-threaded query run 64x faster; you can only run 64 independent queries simultaneously. ### Parallel Query (Multi-Core Execution) The only way a single query can use more than one core is if the **Query Planner** explicitly invokes **Parallel Query**. When the Planner estimates that a table is large enough to justify the overhead, it requests **Background Workers** from the OS to divide the scan across multiple cores. These workers process their portions of the table simultaneously. However, the workers must pass their partial results back to the original supervising process through a **Gather** node, using Inter-Process Communication (IPC). If too many workers are assigned, the IPC overhead outweighs the parallelism gains, and the query becomes *slower* than a single-process scan. > [!TIP] > **The limits**: The maximum number of parallel workers per query is controlled by `max_parallel_workers_per_gather` (default: 2). The total number of background workers across the entire database is capped by `max_worker_processes` (default: 8). ### What Postgres Controls Postgres can't change the hardware. What it controls is how it allocates and uses the memory it's given: - **`shared_buffers`**: The shared page cache (the **[[Manuscript/06 - Resource Management & Processes/6.2 - Shared Buffers (The Page Cache)|shared buffer pool]]**). All backend processes read from and write to this pool. Pages evicted from here must be re-fetched from disk. - **`work_mem`**: Per-operation scratch space for sorts, hash tables, and bitmap builds. When an operation exceeds this budget, it **spills** sorted runs to temporary files on disk. - **`maintenance_work_mem`**: A separate, higher budget reserved for `VACUUM`, `CREATE INDEX`, and `CLUSTER`. These bulk operations benefit from larger allocations to avoid repeated disk passes. When a query outgrows its `work_mem` budget, Postgres doesn't fail—it spills. But spilling re-introduces the latency tax that the entire caching architecture exists to avoid. --- ## 6.1 - Process Family (The Process Family) <img src="assets/arch_process_family.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> When you start Postgres, you aren't launching a single program. You're launching a **supervisor** that immediately spawns a small family of specialized worker processes. Each one has a narrow, well-defined job. Understanding who they are—and what they do between your queries—is essential to understanding every chapter that follows. You can see the family yourself: ```sql SELECT pid, backend_type FROM pg_stat_activity WHERE backend_type != 'client backend'; ``` ``` pid | backend_type -----+------------------------------ 68 | autovacuum launcher 65 | background writer 64 | checkpointer 69 | logical replication launcher 67 | walwriter ``` These five processes (plus the **Postmaster** itself, which doesn't appear in `pg_stat_activity`) are running at all times, even when no client is connected. They are the engine's autonomic nervous system. ### The Postmaster The **Postmaster** is the parent of every other Postgres process. It is the first process that starts and the last to die. Its responsibilities are simple and strict: 1. **Listen** for incoming connections on the configured port. 2. **Fork** a new **backend process** for each accepted connection. 3. **Supervise** all child processes. If any child crashes, the Postmaster terminates all remaining children, cleans up shared memory, and restarts the background workers from a clean state. The Postmaster itself never executes queries. It never touches your data. It exists solely to manage the lifecycle of the processes that do. > [!IMPORTANT] > **Why this matters**: The Postmaster's crash-recovery behavior is the reason Postgres uses processes instead of threads. A crashed backend takes down only itself. The Postmaster detects the failure, resets shared memory, and re-spawns the background workers—all without a full server restart. ### Backend Processes (One Per Connection) When you connect via `psql` or your application's connection pool, the Postmaster forks a dedicated **backend process** for your session. This process: - Parses your SQL - Plans the query - Executes it against **[[Manuscript/06 - Resource Management & Processes/6.2 - Shared Buffers (The Page Cache)|Shared Buffers]]** - Returns results to your client Each backend is an independent OS process with its own memory space (for `work_mem` sorts, hash tables, etc.) but sharing the global `shared_buffers` pool. When you disconnect, the backend exits. ### The Background Workers The remaining processes run continuously in the background, performing maintenance that keeps the engine healthy between—and during—your queries. ### Checkpointer Periodically flushes all **dirty pages** from Shared Buffers to disk and writes a **Checkpoint Record** to the WAL. After a checkpoint, Postgres knows it can recover from that point forward without replaying older WAL segments. Controlled by `checkpoint_timeout` (default: 5 minutes) and `max_wal_size`. → *Introduced in [[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Durable Ledger)|5.1 The Durable Ledger]]* ### Background Writer Proactively scans Shared Buffers for dirty pages and flushes small batches to disk in quiet intervals. Its goal is to ensure that when a backend needs a free buffer, one is already available—preventing the backend from stalling to perform an expensive synchronous write. Controlled by `bgwriter_delay` (default: 200ms). → *Introduced in [[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Durable Ledger)|5.1 The Durable Ledger]]* ### WAL Writer Flushes the **WAL Buffers** (in-memory WAL records) to the WAL files on disk. When `synchronous_commit = on`, individual backends perform their own `fsync()` calls, but the WAL Writer handles the background flushing for asynchronous commits and group commit batching. Controlled by `wal_writer_delay` (default: 200ms). → *Introduced in [[Manuscript/05 - Durability & Transactions/5.1.1 - Commit Tuning (The Loose Handshake)|5.1.1 The Loose Handshake]]* ### Autovacuum Launcher The launcher is a scheduler. It monitors table statistics and, when a table accumulates enough dead tuples, spawns **Autovacuum Worker** processes to reclaim space and update visibility maps. The number of concurrent workers is limited by `autovacuum_max_workers` (default: 3). → *Covered in depth in [[Manuscript/06 - Resource Management & Processes/6.4 - Vacuum & Freezing (The Housekeepers)|6.4 The Housekeepers]]* ### Logical Replication Launcher If logical replication is configured, this process manages the lifecycle of **Logical Replication Workers**—the processes responsible for applying changes received from a publisher to the local database. → *Referenced in [[Manuscript/05 - Durability & Transactions/5.3 - Logical Replication (The Relay Chain)|5.3 The Town Crier]]* ### Startup Process The first child the Postmaster spawns on boot. It replays WAL records to bring the database to a consistent state after an unclean shutdown. On a **replica**, the startup process runs continuously, applying WAL streamed from the primary. → *Referenced in [[Manuscript/05 - Durability & Transactions/5.2 - Crash Recovery (The Recovery Parade)|5.2 The Recovery Parade]]* ### Stats Collector Aggregates runtime statistics from all backends—row counts, block hits, sequential scans, index usage—and writes them to the `pg_stat_*` views. This data feeds the **Query Planner's** cost estimates and the **Autovacuum Launcher's** threshold decisions. Without it, both systems are flying blind. ### Archiver Active only when `archive_mode = on`. When the WAL Writer finishes a 16MB WAL segment, the Archiver copies it to a designated backup location (local directory, S3, etc.). This is the foundation of **Point-in-Time Recovery (PITR)**—the ability to restore the database to any arbitrary moment in the past. ### Logger Active when `logging_collector = on`. Captures all server log output (errors, warnings, slow queries) and writes it to log files, handling rotation and retention. Without it, log output goes to `stderr` only. ### WAL Sender / WAL Receiver Used in **streaming replication**. The **WAL Sender** runs on the primary and streams WAL records over the network to replicas. The **WAL Receiver** runs on the replica and feeds the received records to the **Startup Process** for replay. → *Referenced in [[Manuscript/08 - Distributed Scaling & Clouds/8.0 - The Three Coordination Problems|Chapter 8 - Distributed Scaling & Clouds]]* ### The Full Picture ``` postmaster (PID 1) ├── startup (crash recovery / WAL replay) ├── checkpointer ├── background writer ├── walwriter ├── stats collector ├── autovacuum launcher │ └── autovacuum worker (spawned on demand) ├── logical replication launcher │ └── logical replication worker (spawned on demand) ├── archiver (if archive_mode = on) ├── logger (if logging_collector = on) ├── wal sender (if replicas are connected) ├── backend (connection 1) ├── backend (connection 2) └── backend (connection N) ``` Every process in this tree shares the same `shared_buffers` segment. Every process writes to the same WAL. The Postmaster watches them all, and if any single one fails, it resets the shared state and rebuilds the family from scratch. --- ## 6.1.1 - Connection Mechanics (Long running connections and sessions) <img src="assets/arch_protocol_handshake.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Now that the Postmaster has `fork()`ed a dedicated backend process for you (as we saw in the [[Manuscript/06 - Resource Management & Processes/6.1 - Process Family (The Process Family)|Process Family]]), how does your application actually talk to it? If you are coming from the world of web development, you are used to the **Drive-Thru** model of communication (HTTP). You drive up, shout your order into a plastic box, pay at the window, and drive away. If you want a second burger ten minutes later, you start the entire process from the beginning. The restaurant doesn't remember who you are, and it certainly doesn't keep a table reserved for you. Postgres does not work this way. Postgres is built on the **Table Service** model. When you connect to Postgres via TCP/IP or a Unix domain socket, you aren't just sending a request; you are establishing a **Session**. You sit down, the waiter (your dedicated backend process) arrives at your table, and they stay with you for the duration of the meal. They remember what you ordered first, they know if you are currently in the middle of a transaction, and they maintain your specific preferences until you decide to leave. ### The Anatomy of a Message The Postgres Wire Protocol (Version 3.0) is a stateful, message-based protocol. With the exception of the very first startup packet, every message in the protocol follows a strict, predictable physical layout: 1. **Type Byte**: A single ASCII character indicating what the message is (e.g., `Q` for Simple Query, `D` for DataRow, `C` for CommandComplete). 2. **Length**: A 32-bit integer specifying the total length of the message in bytes. 3. **Payload**: The actual data (the SQL string, the row values, or the error message). > [!NOTE] > **The Universal Translator** > The Postgres wire protocol is so stable and well-documented that it has become a de facto industry standard for database communication. Open-source projects like **`pgwire`** (a Rust library) allow developers to easily build custom databases, proxies, or analytical engines that "speak Postgres." If your custom system implements the protocol, any standard Postgres driver (`psycopg2`, `pgx`, `JDBC`) can connect to it natively. ### The Connection Lifecycle Because the protocol is stateful, the initial connection is significantly more expensive than opening a stateless HTTP connection: 1. **The Startup Packet**: The client initiates the TCP connection and sends a `StartupMessage` containing the protocol version, the target database name, and the user name. 2. **The Authentication Challenge**: The server responds with an `AuthenticationRequest` (e.g., asking for an MD5 or SCRAM-SHA-256 password hash). The client responds with the hashed password. 3. **The Backend Setup**: Once authenticated, the backend allocates its private `work_mem` and registers itself in the global `ProcArray`. 4. **Parameter Status**: The backend sends down a flurry of `ParameterStatus` messages (telling the client the server's timezone, character encoding, and version). 5. **ReadyForQuery**: Finally, the backend sends the `ReadyForQuery` message. Only now is the engine ready to accept your first SQL command. > [!IMPORTANT] > **The Cost of Connection Setup**: Because setting up a session requires OS-level process forking and cryptographic handshakes, Postgres is optimized for **connection reuse**. Opening a fresh connection for every single query in a high-traffic web environment is the fastest way to paralyze your database. This is why we use **[[Manuscript/08 - Distributed Scaling & Clouds/8.4 - Scaling Connections (Connection Pooling)|Connection Pooling]]** to keep the "tables" occupied even when the clients change. ### The Simple vs. Extended Protocol Once the connection setup is complete, you can send SQL to the backend in one of two ways. #### The Simple Query Protocol (The Waiter's Pad) You send a single `Query` message containing a raw SQL string (e.g., `SELECT * FROM animals`). The backend parses the string, plans it, executes it, sends the `DataRow` messages back, and finishes with a `CommandComplete` message. This is great for one-off commands, but if you send the exact same query 1,000 times, the backend wastes CPU cycles re-parsing and re-planning the exact same text every single time. #### The Extended Query Protocol (The Pre-Printed Menu) To avoid this redundant work, modern drivers use the Extended Protocol, which breaks the process into distinct steps: - **Parse**: You send a query with placeholders (`SELECT * FROM animals WHERE species_id = $1`). The backend parses it, builds the execution plan, and saves it in memory as a **Prepared Statement**. - **Bind**: You send the actual parameter values (e.g., `$1 = 4`). The backend binds those values to the prepared plan and creates a **Portal**. - **Execute**: You tell the backend to execute the Portal and return the rows. The architectural payoff is massive: you can perform one `Parse` step, and then loop through a thousand `Bind` and `Execute` steps. You skip the Query Planner entirely for subsequent executions. ### The Power of Session State In the "Table Service" model, the connection carries **State**. This persistent state is what enables the core features of a relational database: - **Transactions**: You can start a transaction in one message (`BEGIN`), send five more messages to modify data, and then commit in a final message (`COMMIT`). The server keeps track of your "In-Progress" status. - **Session Settings (GUCs)**: You can change your `work_mem` or `timezone` for just your current session (`SET statement_timeout = '5s'`) without affecting anyone else. - **Temporary Tables**: You can create tables that exist only for the lifetime of your session and vanish the moment you disconnect. By treating the connection as a long-running conversation rather than a series of isolated shouts, Postgres provides a level of contextual consistency that stateless protocols simply cannot match. --- ## 6.2 - Shared Buffers (The Page Cache) <img src="assets/arch_shared_buffers_rack.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> > [!NOTE] Production Story: Friday at 4:55 PM (The Ice-Cold Buffers) > "We upgraded our primary database to a high-spec server with 512GB of RAM. The CPU was fast, and the SSDs were top-tier. Yet, running a simple `SELECT count(*)` on our 10GB `orders` table took 15 seconds every single time. The database was crawling. > > When we inspected the query execution buffers, we saw that it was reading 10GB of data from the SSD *every time*. Why? The server's `shared_buffers` was left at its default value: 128MB. The database was starving in a land of plenty. It was using RAM for nothing, forcing every query to wait for the disk. Once we raised `shared_buffers` to 128GB, the first scan took 2 seconds to warm the cache, and all subsequent scans completed in 15 milliseconds." To minimize disk I/O, Postgres reserves a shared region of RAM at boot called **Shared Buffers**. This is a global cache that holds the most frequently accessed 8KB **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Pages]]**. Shared Buffers serves as a high-speed staging area between persistent storage and active queries. Because Postgres uses a process-per-connection architecture, this memory is allocated as shared memory (using System V or POSIX IPC). It is mapped into every backend process's address space, allowing workers to share data without duplication. When a query needs a page, Postgres first performs a high-speed lookup in the **Buffer Mapping Hash Table**: - **Buffer Hit**: The hash table points to a slot in RAM. The page is fetched in nanoseconds. - **Buffer Miss**: The hash table returns nothing. The engine must issue a physical `pread()` to the OS, find a free slot in the buffer pool, and load the page from disk. > [!NOTE] > **In PostgreSQL Terms** > * **Shared Buffers**: The globally shared memory pool caching 8KB pages. > * **Buffer Hit**: Finding the requested page already in memory. > * **Buffer Miss**: Forcing the engine to read the page from physical disk. ### The Partitioned Mapping Table To avoid lock contention, the Buffer Mapping Hash Table is divided into separate partitions (typically 128). Each partition is protected by its own **Lightweight Lock (LWLock)**. This allows hundreds of backends to perform concurrent lookups without bottlenecking on a single global lock. ### Buffer Pinning When a worker finds a page, it **Pins** the buffer—an atomic reference count. While pinned, a page is protected from eviction and structural relocation. If you observe a **`BufferPin`** wait event, an exclusive operation (like `VACUUM`) is waiting for current readers to finish. ### The Clock Sweep (Eviction) Because buffers are finite, the engine must select candidate pages for eviction when it needs to load a new page. Instead of using a traditional Least Recently Used (LRU) list, which requires expensive global locks to move pages to the head of the list on every read, Postgres uses a lock-free **Clock Sweep** algorithm. Imagine a circular array of all buffer page descriptors. A sweeping clock hand points to a single descriptor and marches around: 1. **usage_count > 0**: If a page has been read recently, the sweep hand decrements its usage count by 1 and moves to the next descriptor. 2. **usage_count == 0**: If the page has not been accessed recently (usage count is 0) and is currently unpinned, the page is selected as the eviction candidate. 3. **The Sweep Heat**: Every time a query accesses a page, it increments the page's `usage_count` (up to a maximum of 5). This ensures "hot" pages build up "heat" and require up to 5 complete clock sweeps before they can be evicted, while "cold" pages are rapidly recycled. If the selected page is "dirty" (has uncommitted or unsynced changes), it must be written to disk before it can be reused, which can stall the executing backend. ```mermaid stateDiagram-v2 [*] --> Clean : Buffer Miss (Read) Clean --> Pinned : Access Pinned --> Clean : Release Pin Clean --> Dirty : UPDATE / INSERT Dirty --> Flushing : Checkpoint / BgWriter Flushing --> Clean : Disk Sync (fsync) Clean --> Candidate : usage_count == 0 Candidate --> [*] : Evicted Dirty --> Candidate : usage_count == 0 Candidate --> Flushing : Must Flush Before Evict ``` > [!NOTE] > **Ring Buffers**: For large sequential scans or `VACUUM` operations, Postgres uses a **Buffer Access Strategy**. Instead of displacing the entire cache, the process uses a small private **Ring Buffer** (typically 32 pages). This allows the scan to recycle its own slots without evicting hot data used by other concurrent queries. You can inspect the size of the buffer pool using the `SHOW` command: ```sql SHOW shared_buffers; ``` The **`pg_buffercache`** extension provides visibility into the buffer pool, allowing you to see exactly which pages are currently resident. ### 🧪 Lab Challenge: Inspecting the Buffer Cache **The Request**: "The database is slow, and we suspect a single massive table is evicting everything else from RAM. Prove which table is hogging the buffer pool." #### The Investigation To see the internal state of the Shared Buffers, we use the `pg_buffercache` extension. This provides a detailed view of the buffer pool's contents. ```sql -- Creating the thermal camera (if not already present) CREATE EXTENSION IF NOT EXISTS pg_buffercache; -- Which tables are taking up the most space in Shared Buffers? SELECT c.relname, count(*) AS buffers, round(count(*) * 8192 / 1024 / 1024, 2) AS size_mb FROM pg_buffercache b JOIN pg_class c ON b.relkind IN ('r', 'i') AND b.relfilenode = pg_relation_filenode(c.oid) GROUP BY 1 ORDER BY 2 DESC LIMIT 10; ``` #### The Diagnosis The `supply_deliveries` table is consuming the majority of the shared memory. Because this table is larger than the `shared_buffers` setting, it is evicting smaller, high-frequency tables. This forces other queries to wait for disk I/O. #### The Lazy Fix Identify the query performing a sequential scan on the massive table and add an index to reduce its footprint, or increase the size of the cache. ```sql -- Checking the current size of the buffer pool SHOW shared_buffers; ``` #### The Reward By reducing the I/O footprint of the large table, you preserve space in memory for frequently accessed data. > [!TIP] > **Double Buffering**: Postgres relies on the **OS Kernel Cache** to handle writes and read-ahead. This means data often exists in both `shared_buffers` and the OS cache. While this seems redundant, it allows Postgres to manage "hot" data explicitly while letting the kernel handle broader I/O optimizations. > [!NOTE] The Click > **Concept**: Postgres does not talk directly to the disk controller. It talks to the operating system kernel. > **Payoff**: Because Postgres writes go through standard OS system calls (`pread`/`pwrite`), pages exist in both `shared_buffers` (managed by Postgres's Clock Sweep) and the OS page cache (managed by kernel algorithms like LRU). Sizing `shared_buffers` to only 25% of system RAM is not a limitation; it is a design choice that leaves the remaining 75% for the OS kernel to manage raw read-ahead and write-back buffering. --- ## 6.3 - Work Mem (Private Working Memory) <img src="assets/arch_private_desk_spill.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> While **Shared Buffers** are communal, many operations require private, high-speed scratch space. When a backend process performs a **[[Operations/ResultSet/Sort|Sort]]** or builds a **[[Operations/ResultSet/HashJoin|Hash Table]]**, it cannot use the shared buffer pool without displacing resident data and incurring high lock contention. Instead, every individual operation is dynamically allocated memory controlled by the **`work_mem`** setting. ### The Tuplesort Mechanic When Postgres executes an `ORDER BY`, it initiates a **Tuplesort**. This operation attempts to organize the data entirely within the bounds of `work_mem`. - **In-Memory Sort**: If the dataset fits within `work_mem`, Postgres performs an in-memory Quicksort. - **The Disk Spill**: If the data exceeds `work_mem`, the Tuplesort transitions to an **External Merge Sort**. The engine breaks the data into chunks, writes them to temporary disk files in `base/pgsql_tmp/`, and merges them back together. Spilling to disk is a performance bottleneck. It forces Postgres to move from nanosecond-scale RAM access to microsecond-scale SSD access. A small spill can turn a sub-second query into a multi-second operation. ### The Hash Join Batches Joins also rely on this private memory. When executing a Hash Join, the engine reads the smaller table and builds an **In-Memory Hash Table** for O(1) lookups. If the hash table exceeds `work_mem`, the planner splits the operation into **Batches** and dumps the overflow to temporary files. This multi-pass operation dramatically increases I/O latency. ### The Multiplication Hazard Unlike `shared_buffers`, which is pre-allocated at startup, `work_mem` is allocated **per operation.** > [!CAUTION] > **The Memory Multiplication Hazard** > A single query plan can contain multiple **Sort** or **Hash Join** nodes, each allowed to consume the full `work_mem` allocation. A query with three such nodes and 100 concurrent connections could demand 300 times the `work_mem` value. Setting `work_mem` to 64MB under high concurrency could easily exhaust available RAM and trigger the **OOM Killer**. Administrative tasks like `CREATE INDEX` or `VACUUM` use a separate pool called **`maintenance_work_mem`**. Because these operations are executed by system workers or administrators rather than concurrent web users, this value can be set significantly higher than `work_mem` (e.g., 1GB or 2GB). A large allocation allows `VACUUM` to complete in a single pass without intermediate disk writes. If a query is slow due to a large sort, you can temporarily increase the memory allocation for that specific session: ```sql SET work_mem = '64MB'; EXPLAIN (ANALYZE, BUFFERS) SELECT ...; ``` If you see **`Sort Method: external merge`** or **`Hash Batches: > 1`** in the output, the allocation was too small and the engine was forced to spill to disk. ### 🧪 Lab Challenge: The Memory Spill (External Merge Sort) **The Request**: "Our nightly supply report has suddenly become three orders of magnitude slower. We haven't changed the SQL." #### The Investigation Check the plan with `BUFFERS` to see where the data is being processed: ```sql -- Force a small work_mem to simulate the bottleneck SET work_mem = '64kB'; EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM supply_deliveries ORDER BY quantity_kg; ``` **Result**: ```text Sort Method: external merge Disk: 1240kB -- ⚠️ The Disk Spill! Buffers: shared hit=8, temp read=155, temp write=155 ``` #### The Diagnosis The `work_mem` allocation was too small to hold the dataset. The engine was forced to "spill" to temporary files. The `temp read` and `temp write` lines in the output are physical evidence of this I/O. #### The Lazy Fix Increase the `work_mem` for this specific session to ensure the sort stays in memory. ```sql SET work_mem = '4MB'; EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM supply_deliveries ORDER BY quantity_kg; ``` #### The Reward **Result**: ```text Sort Method: quicksort Memory: 2450kB Buffers: shared hit=8 ``` The `temp` buffers vanish, and the sort method changes to `quicksort`. The query time drops from seconds to milliseconds because the data never left RAM. > [!WARNING] > **Don't set `work_mem` globally to 1GB.** While it fixes the spill, it creates an OOM risk. Always set it at the session or query level for specific, heavy operations. --- ## 6.4 - Vacuum & Freezing (The Housekeepers) <img src="assets/arch_maintenance_crew.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> In Postgres, data is not immediately deleted. Every `UPDATE` and `DELETE` leaves dead row versions that remain in the heap, occupying space until they are reclaimed. If left unmanaged, they accumulate into **[[Manuscript/06 - Resource Management & Processes/6.5 - Tuple Bloat (Garbage Collection)|Bloat]]**. This forces the engine to scan more pages to retrieve live data. To manage this, Postgres uses **Autovacuum**: a background worker responsible for reclaiming MVCC "dead space" and freezing old rows before the 32-bit transaction counter reaches its limit. > [!NOTE] > While other workers like the **Background Writer** and **Checkpointer** manage the flow of data between RAM and Disk, **Autovacuum** is the only process that physically reclaims the space inside those files. > [!NOTE] > **In PostgreSQL Terms** > * **Vacuum**: The background process that reclaims physical space occupied by dead tuples. > * **Free Space Map (FSM)**: The data structure tracking available space inside pages. > * **Visibility Map (VM)**: The data structure tracking which pages contain only live tuples. ### The Cleanup Crew: Autovacuum Autovacuum relies on two physical maps to optimize its sweep: - **Visibility Map (VM)**: A bitset indicating page status using two bits: - **Bit 0 (All-Visible)**: Every tuple is visible to everyone; enables **Index-Only Scans**. - **Bit 1 (All-Frozen)**: Every tuple is permanent history; allows vacuum to skip the page entirely during anti-wraparound runs. - **Free Space Map (FSM)**: A record of exactly how much reusable space remains on each page, helping future inserts find a home. #### Shared Lock Coordination (`pg_multixact`) A tuple header has limited space for transaction metadata. When multiple transactions place shared locks (`SELECT FOR SHARE`) on the same row, Postgres cannot store all their IDs in the tuple header. Instead, the engine uses **`pg_multixact`**. The tuple's `xmax` field is updated with a "MultiXact ID" that points to a list of lockers in an external file. > [!WARNING] > **The Maintenance Tax**: While this enables high-concurrency shared locking, it adds a maintenance burden. Autovacuum must also clean up old `pg_multixact` segments. Heavy shared-row locking can significantly increase autovacuum overhead. ### The Two-Phase Vacuum Vacuuming a table is a two-phase mechanical operation: 1. **Phase 1: Heap Scan**. Autovacuum scans the heap and collects the physical addresses (**TIDs**) of dead tuples into a buffer. 2. **Phase 2: Index Cleanup**. The engine visits every index and removes pointers referencing the collected TIDs. Only after the indexes are cleared can the heap space be marked as reusable in the Free Space Map (FSM). > [!NOTE] The Click > **Concept**: Vacuum does not shrink table files. It just sweeps them. > **Payoff**: When `VACUUM` runs, it does not release disk blocks back to the operating system (which would require expensive OS locks and file truncations). It simply marks the slots of dead tuples as "empty" in the Free Space Map (FSM). The next time you run `INSERT`, Postgres checks the FSM first and writes the new data directly into those empty slots, preventing the file from growing. **The rule of thumb:** Vacuum clears the heap first to find dead tuples, then sweeps the indexes to remove pointers to them, and finally updates the Free Space Map so new inserts can reuse the space. > [!IMPORTANT] > **The Maintenance Budget**: The size of the TID buffer is controlled by `maintenance_work_mem`. If the buffer is too small, autovacuum must perform multiple index passes, which increases the I/O cost and duration of the vacuum. ```sql -- Is autovacuum keeping up? SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10; ``` A high `n_dead_tup` count means autovacuum is falling behind. By default, it triggers when dead tuples exceed **20%** of live tuples (`autovacuum_vacuum_scale_factor = 0.20`). For high-write tables you typically want to lower that threshold: ```sql -- Make autovacuum more aggressive on a busy table ALTER TABLE orders SET ( autovacuum_vacuum_scale_factor = 0.01, -- trigger at 1% dead tuples autovacuum_vacuum_cost_limit = 1000 -- and give it a bigger I/O budget ); ``` ### From Cleanup to Conservation: The Historical Horizon Reclaiming dead tuples is only part of autovacuum's responsibility. While "cleaning up waste" keeps the table lean, the engine has a more profound duty: ensuring that old facts remain visible even as the system's transaction counter resets. This is where maintenance transitions from housekeeping to **historical integrity**. ### The Freeze Ritual > [!IMPORTANT] > **The Existential Purpose of Vacuum**: `VACUUM` is not just for reclaiming disk space. Its primary purpose is to prevent **[[Manuscript/05 - Durability & Transactions/5.6 - Transaction ID Wraparound (The Infinite Calendar)|XID Wraparound]]**. If vacuuming stops, the database will eventually hit the 2.1 billion transaction limit and shut down to prevent data corruption. Every row is stamped with a 32-bit transaction ID (XID). To prevent wraparound, autovacuum performs a **Freeze** on rows old enough to be permanent history. The tuple's `xmin` is replaced with a special `FrozenTransactionId` marker, indicating the row is visible to all future transactions. ```sql -- How close are our tables to the XID wraparound danger zone? SELECT relname, age(relfrozenxid) AS xid_age, relpages FROM pg_class WHERE relkind = 'r' ORDER BY age(relfrozenxid) DESC LIMIT 10; ``` The older the `xid_age`, the more urgently the table needs a vacuum pass to freeze its rows. The danger threshold is ~**2 billion transactions**. You can also force a full freeze on any table at any time: ```sql -- Manual emergency freeze for a specific table VACUUM FREEZE orders; ``` > [!WARNING] > Never disable `autovacuum`. It is a critical system process. Disabling it leads to XID wraparound panics, uncontrolled table growth, and eventual cluster shutdown. ### Throttling and I/O Spikes To prevent maintenance from overwhelming the system, autovacuum is throttled by a cost-based credit system (`autovacuum_vacuum_cost_limit`). This ensures background cleanup does not starve user queries for I/O. However, if the Checkpointer is too aggressive, it can flood the I/O subsystem with dirty pages. Spread these writes using `checkpoint_completion_target` (typically 0.9) to minimize latency spikes during checkpoints. Without these background processes, Postgres would eventually succumb to bloat and XID exhaustion. ### 🧪 Lab Challenge: The Maintenance Sweep (Reclaiming Space) **The Request**: "We just deleted 100,000 old records, but the table size on disk didn't shrink. Prove that autovacuum is doing its job." #### The Investigation Check the current state of dead tuples on the table: ```sql SELECT n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables WHERE relname = 'supply_deliveries'; ``` **Result**: ```text n_live_tup | n_dead_tup | last_autovacuum -------------+------------+----------------- 500,000 | 100,000 | 2024-03-25... ``` #### The Diagnosis The `n_dead_tup` counter confirms that 100,000 versions are physically present but logically dead. Postgres hasn't reclaimed the space yet. If you insert new data now, the engine will prefer to reuse these "dead" slots rather than extending the file. #### The Lazy Fix You can wait for **Autovacuum** to trigger automatically (based on the `scale_factor`), or you can manually summon it for an immediate sweep: ```sql VACUUM ANALYZE supply_deliveries; ``` #### The Reward Check the stats again: ```text n_live_tup | n_dead_tup | last_autovacuum -------------+------------+----------------- 500,000 | 0 | 2024-03-25... ``` The dead tuples are gone. The space is now marked as "Free" in the **Free Space Map (FSM)**, and future inserts will be lightning-fast because they won't trigger expensive disk-allocation syscalls. > [!IMPORTANT] > **The Disk Myth**: `VACUUM` does NOT return space to the operating system. It marks the space as reusable *within* the existing file. If you need to shrink the physical file size, you need `VACUUM FULL` (which locks the table exclusively). --- ## 6.5 - Tuple Bloat (Garbage Collection) <img src="assets/arch_vacuum_bloat.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> When updates and deletes happen faster than they can be cleaned, the database enters a state of **Bloat**. Even though your "live" data remains the same size, the physical data file on disk begins to swell. This is because the "dead" records (the ghost tuples left by MVCC) are taking up space that hasn't been recycled yet. > [!IMPORTANT] > **The Survival Clause** > Garbage Collection is not an "optimization" you can turn off to save CPU. In an MVCC engine like Postgres, space is not automatically reclaimed when a row is deleted. If you disable the cleanup processes, your storage will grow until the disk is consumed, and your indexes will swell until every query stalls. In Postgres, maintenance is a requirement for existence. --- ### The Mechanics: Why Bloat Happens Under normal circumstances, **Autovacuum** marks dead space as "Free." Future inserts then find these empty holes and fill them, keeping the file size stable. However, two things can break this cycle: 1. **Long-Running Transactions**: If a single transaction stays open for hours, Postgres *cannot* clean any rows that were deleted after that transaction started. The engine must keep them around in case that long-running process needs to see them. 2. **High-Frequency Updates**: If you update 10,000 rows a second, but your vacuum is only tuned to clean once every minute, the "dead" data will accumulate faster than it can be removed. --- ### Table Bloat vs. Index Bloat Bloat doesn't just happen in your tables; it happens in your indexes, too. - **Table Bloat**: Is generally manageable. Once space is marked free, it will be reused for new rows. - **Index Bloat**: Is more structural. Because a B-Tree index is a sorted map, it cannot easily "squeeze" out empty space without a massive re-shuffling of the tree. A bloated index makes every search slower because it forces the engine to read more pages. > [!WARNING] > **The Vacuum Trap**: Standard `VACUUM` does **not** return disk space to the Operating System. It only makes the space available for future *Postgres* data. To truly shrink a 100GB file back down to 10GB, you must perform a structural reclamation. --- ### Reclaiming the Space If bloat becomes catastrophic (e.g., your disk is 95% full), you have three options to physically shrink the files: | Method | Impact | Best For... | | :--- | :--- | :--- | | **`REINDEX`** | Rebuilds the index from scratch. | Fixing slow, bloated indexes. | | **`VACUUM FULL`** | Rewrites the entire table into a new file. | **Emergency only**. Locks the table entirely. | | **`pg_repack`** | Rebuilds the table in the background. | Professional production environments. | --- ### Diagnostics: The Waste Audit To see if your cleanup crew is falling behind, you can check the ratio of live to dead tuples: ```sql -- Check the 'dead tuple' count for a specific table SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables WHERE relname = 'supply_deliveries'; ``` **If `n_dead_tup` is consistently high, your vacuum is falling behind.** --- ### The Fix: Tuning the Cleanup Thresholds By default, Postgres only starts cleaning when **20%** of a table is dead. For a billion-row table, that is far too late. You can tell Postgres to be more aggressive on a per-table basis: ```sql -- Trigger vacuum at 1% dead rows instead of 20% ALTER TABLE supply_deliveries SET ( autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_limit = 1000 ); ``` > [!TIP] > **The Sleeping Janitor**: If you see the wait event `Timeout:VacuumDelay` in your tracer, it means Autovacuum is intentionally sleeping to avoid consuming too much I/O. If bloat is rising, you need to increase the `cost_limit` to let it work longer before taking a break. --- ### Recap: Managing the Overflow > [!NOTE] > - **Bloat is caused by MVCC history** that hasn't been recycled. > - **Long-running transactions** are the #1 cause of "stuck" bloat. > - **Index bloat** is harder to fix than table bloat. > - **Tune per-table** for busy workloads; don't wait for the 20% default. --- ## 6.6 - RAM, CPU & Disk (The Physical Machine) <img src="assets/arch_hardware_evolution.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> The previous sections described how Postgres *manages* its memory budgets. This section describes what is actually happening in the hardware and OS kernel when those budgets are exercised — because the decisions Postgres makes only make sense once you can picture what is happening below the abstraction. ### RAM: Two Caches, Not One There is a common misconception about `shared_buffers`: that it *is* the Postgres memory, and that everything else goes to disk. The reality is more layered. When Postgres reads a page from disk, it calls `pread()` — a standard POSIX syscall. The kernel intercepts this call and routes it through the **Linux Page Cache** (also called the **Buffer Cache**), a kernel-managed pool of recently accessed file pages. The kernel doesn't know anything about Postgres tables or 8KB pages; it just sees file offsets. So when a page arrives in Postgres, it has actually passed through *two* independent caches: 1. **The OS Page Cache**: Kernel-managed. Contains file pages as the kernel understands them. Lives in ordinary kernel address space. 2. **`shared_buffers`**: Postgres-managed. Contains those same pages mapped into Postgres's shared memory segment, indexed by the Buffer Mapping Hash Table. This is the reason `shared_buffers` is capped at 25% of RAM. The OS Page Cache is not idle; it is actively caching the same files below Postgres's sight line. Giving Postgres 90% of RAM doesn't eliminate that second cache — it just starves it. On workloads with large sequential scans or heavy checkpoint I/O, the OS Page Cache does meaningful, non-redundant work. > [!NOTE] > **The `O_DIRECT` exception**: Some systems configure Postgres to use `O_DIRECT` for WAL writes, bypassing the OS Page Cache entirely for those files. This avoids double-buffering the WAL. It is not the default. ### How `shared_buffers` is Allocated At startup, Postgres allocates `shared_buffers` as a single large **shared memory segment** using `shmget()` or `mmap()` (depending on the OS and configuration). All backend processes attach to this same segment. This means every backend reading or writing a page in the buffer pool is accessing the exact same physical memory — no copying, no per-process duplication. This also means that when you set `shared_buffers = 8GB`, Postgres reserves 8GB of **physical DRAM** on startup, not virtual address space. The OS cannot page it out. ### CPU: When I/O Is Not the Bottleneck Once a page is in `shared_buffers`, the query engine starts doing CPU work: - **Tuple evaluation**: Walking the rows in a page and evaluating WHERE clause predicates. - **Hash computation**: Building hash tables for Hash Joins and Hash Aggregations. - **Sort passes**: Quicksort in-memory for Sort nodes. - **Expression evaluation**: Running functions, type coercions, and operator calls. A query whose working set fits entirely in `shared_buffers` is CPU-bound, not I/O-bound. Throwing more RAM at a CPU-bound query won't help. Adding an appropriate index to eliminate the tuple evaluation loop will. This distinction — **I/O-bound vs. CPU-bound** — is the central diagnostic axis of Chapter 7. Wait events tell you which world your query is living in. > [!TIP] > On modern multi-core hardware, Postgres can execute a single query using **parallel workers** (see `max_parallel_workers_per_gather`). Each worker is a separate OS process with its own CPU affinity. Parallel queries help on CPU-bound aggregations and sorts; they are irrelevant if the bottleneck is a single slow disk read. ### Disk: A History of I/O Assumptions When a page is not in either cache, the database engine must fetch it from the block device. The kernel translates the `pread()` call into an I/O request submitted to the device driver. What happens next has changed dramatically over the history of database infrastructure — and will continue to change. Postgres's defaults encode a specific set of assumptions about what "storage" costs. Understanding those assumptions, and when they were made, is the key to tuning correctly for hardware you are actually running. ### Era 1: Spinning HDD (The Physical Constraint) A spinning hard drive reads data by moving a physical read head across a rotating magnetic platter. The mechanical latency is real and unavoidable: a **rotational seek** takes 4–10 ms, bounded by the physics of moving steel. The consequences for a database are severe. Reading 10,000 random 8KB pages from an HDD: ``` 10,000 seeks × 8 ms average = 80 seconds ``` Sequential reads are dramatically cheaper — the head doesn't need to move far. This asymmetry between random and sequential I/O is the foundational assumption baked into Postgres's query planner. `random_page_cost` defaults to **5.0**. That number was chosen for spinning disk. It tells the planner: "a random read is four times more expensive than a sequential read." On spinning disk, this is conservative — the real ratio is often 50–100×. ### Era 2: SATA SSD (The First Inversion) SATA SSDs have no moving parts. Random access latency drops to ~500 µs — roughly 10–20× faster than a seeking HDD. But SATA SSDs still use the SATA interface, which was designed for spinning disks. The interface itself is a bottleneck: SATA caps at ~550 MB/s throughput and can't issue many parallel I/O operations concurrently. On a SATA SSD, the planning assumptions begin to break down. The `random_page_cost = 5.0` default overstates the penalty for random reads. Many workloads benefit from lowering it to ~3.0. ### Era 3: NVMe (The Interface Disappears) NVMe SSDs connect over PCIe lanes directly to the CPU. They have no mechanical seek, no SATA bottleneck, and can service tens of thousands of I/O operations in parallel. Random access latency is ~100 µs or lower. At this point, the distinction between "random" and "sequential" reads has nearly vanished at the hardware level. The `random_page_cost = 5.0` Postgres default is now wrong by a factor of several. ```sql -- For NVMe: close the planning gap between random and sequential ALTER SYSTEM SET random_page_cost = 2.1; SELECT pg_reload_conf(); ``` > [!WARNING] > Don't set `random_page_cost = 2.0`. Even NVMe has a real cost relative to L2 cache (100 µs vs. 7 ns — a 14,000× difference). The planner needs *some* signal that disk is not free. ### Era 4 and Beyond: Persistent Memory (The Category Blurs) **Persistent Memory (PMEM / Optane)** sits in DIMM slots alongside DRAM, but survives power loss. Latency is ~300 ns — faster than NVMe by 300×, and only 3× slower than DRAM. This completely dissolves the traditional RAM/disk boundary. Storage-class memory architectures like this have already influenced database design. **Neon** and **Amazon Aurora** separate compute from storage entirely, replacing the local disk with network-attached page stores. WAL is shipped over the network instead of written to local disk. The lesson is not a specific set of numbers to memorize. It is this: > **Postgres's defaults are calibrated to an era. The hardware has already changed twice since those defaults were written. It will change again.** Every time you deploy Postgres on a new infrastructure — a new instance type, a new cloud region, a new disk tier — the correct behavior is to benchmark actual I/O latency and re-calibrate `random_page_cost`, `seq_page_cost`, and `effective_cache_size` accordingly. The database engine has no way to detect that it's running on NVMe instead of a 2003-era SCSI disk. That judgment is yours. ```sql -- effective_cache_size tells the planner how much OS Page Cache is available. -- It does not allocate memory; it only influences index-vs-seqscan decisions. SHOW effective_cache_size; -- A reasonable starting point: about 75% of total system RAM ALTER SYSTEM SET effective_cache_size = '24GB'; -- for a 32GB server SELECT pg_reload_conf(); ``` > [!NOTE] > **The Analogy Limit**: A shared buffer pool metaphor accurately describes the *role* of `shared_buffers`, but it implies a static, fixed capacity. Real storage hardware is not static. The speed of retrieval and the cost of physical I/O are parameters of the hardware generation you happen to be running — not constants of the universe. ### Observing Hardware Pressure: `pg_stat_io` Postgres 16 introduced `pg_stat_io`, a view that exposes per-backend, per-context I/O counters. This is the most direct window into whether your hardware is actually being exercised: ```sql -- Which backend types are hitting disk the most? SELECT backend_type, object, context, reads, hits, writes, evictions, ROUND(hits::numeric / NULLIF(reads + hits, 0) * 100, 1) AS hit_rate_pct FROM pg_stat_io WHERE reads + hits > 0 ORDER BY reads DESC; ``` A `hit_rate_pct` near 100% means shared buffer pool is doing its job and reads are being served from RAM. As `hit_rate_pct` drops below ~95% on a busy OLTP system, the buffer pool is too small for the working set and physical disk I/O is becoming a significant cost. ```sql -- Reset to get a fresh baseline after config changes SELECT pg_stat_reset_shared('io'); ``` --- ## 6.7 - Summary (Resource Management & Processes) > The engine is a machine designed to hide the profound latency of physical storage. It fiercely guards its working set in memory, spilling to disk only when starved, bound entirely by the limits of the hardware beneath it. <div style="page-break-after: always;"></div>