# Chapter 6: Resource Management & Processes ## Chapter 6: The Hunger of Resources (CPU, Memory, & Disk) ![[assets/arch_resource_hunger.png|450]] Query execution has a hard physical cost. Postgres can't make RAM faster, disk cheaper, or cores multiply. It can only decide—aggressively and constantly—which data deserves to live in which layer, and how many workers to assign to a problem. To understand these decisions, you need a clear picture of the **Resource Hierarchy**: the physical stack of processing cores, memory, and storage that every query touches. ### The Process Model (One Core Per Connection) Unlike many modern runtimes that use lightweight threads, Postgres uses a **Process-Per-Connection** model. This is not a historical accident—it is a deliberate architectural choice for **fault isolation**. If a backend process crashes (due to a buggy extension, an out-of-memory kill, or a corrupted page), it dies alone. The postmaster simply cleans up the shared memory it touched and moves on. In a threaded model, a single corrupted thread can poison shared memory and bring down the entire server. When you connect, the OS spawns an entirely independent backend process. Every active query is handled end-to-end by this single worker, and a single process can only execute on a single physical CPU core. **Throwing a 64-core machine at a single slow query will do absolutely nothing**—one core will hit 100% utilization, and the other 63 will sit idle. If you have 100 active connections, you have 100 independent processes, each consuming OS memory and competing for CPU scheduler time. ### Parallel Query (Multi-Core Execution) The only way a single query can use more than one core is if the **Query Planner** explicitly invokes **Parallel Query**. When the Planner estimates that a table is large enough to justify the overhead, it requests **Background Workers** from the OS to divide the scan across multiple cores. These workers process their portions of the table simultaneously. However, the workers must pass their partial results back to the original supervising process through a **Gather** node, using Inter-Process Communication (IPC). If too many workers are assigned, the IPC overhead outweighs the parallelism gains, and the query becomes *slower* than a single-process scan. > [!TIP] > **The limits**: The maximum number of parallel workers per query is controlled by `max_parallel_workers_per_gather` (default: 2). The total number of background workers across the entire database is capped by `max_worker_processes` (default: 8). ### The Latency Stack Hardware latency is not linear. When a worker needs data, the physical distances are enormous: | Layer | Typical Latency | Notes | | :--------------------- | :-------------- | :----------------------------------------- | | **CPU L1 Cache** | ~1.5 ns | On-chip; 32–64 KB per core | | **CPU L2 Cache** | ~7 ns | Still on-chip; 256 KB – 1 MB per core | | **Main Memory (RAM)** | ~100 ns | DIMM; fast, but 200× slower than L1 | | **NVMe SSD** | ~100 µs | ~1,000× slower than RAM | | **SATA SSD** | ~500 µs | ~5,000× slower than RAM | | **Spinning HDD** | ~10 ms | ~100,000× slower than RAM; rotational seek | The numbers matter more than any analogy. A random HDD read is five orders of magnitude slower than fetching from RAM. The gap between "data is cached" and "data is not cached" isn't a minor inconvenience—it changes the query's character entirely. This is why Postgres builds so much machinery around one goal: **keep the working set in RAM**. ### What Postgres Controls Postgres can't change the hardware. What it controls is how it allocates and uses the memory it's given: - **`shared_buffers`**: The shared page cache (the **[[Manuscript/06 - Resource Management & Processes/6.2 - Shared Buffers (The Warming Rack)|Warming Rack]]**). All backend processes read from and write to this pool. Pages evicted from here must be re-fetched from disk. - **`work_mem`**: Per-operation scratch space for sorts, hash tables, and bitmap builds. When an operation exceeds this budget, it **spills** sorted runs to temporary files on disk. - **`maintenance_work_mem`**: A separate, higher budget reserved for `VACUUM`, `CREATE INDEX`, and `CLUSTER`. These bulk operations benefit from larger allocations to avoid repeated disk passes. When a query outgrows its `work_mem` budget, Postgres doesn't fail—it spills. But spilling re-introduces the latency tax that the entire caching architecture exists to avoid. --- ## 6.1 The Process Family ![[assets/arch_process_family.png|450]] When you start Postgres, you aren't launching a single program. You're launching a **supervisor** that immediately spawns a small family of specialized worker processes. Each one has a narrow, well-defined job. Understanding who they are—and what they do between your queries—is essential to understanding every chapter that follows. You can see the family yourself: ```sql SELECT pid, backend_type FROM pg_stat_activity WHERE backend_type != 'client backend'; ``` ``` pid | backend_type -----+------------------------------ 68 | autovacuum launcher 65 | background writer 64 | checkpointer 69 | logical replication launcher 67 | walwriter ``` These five processes (plus the **Postmaster** itself, which doesn't appear in `pg_stat_activity`) are running at all times, even when no client is connected. They are the engine's autonomic nervous system. ### The Postmaster The **Postmaster** is the parent of every other Postgres process. It is the first process that starts and the last to die. Its responsibilities are simple and strict: 1. **Listen** for incoming connections on the configured port. 2. **Fork** a new **backend process** for each accepted connection. 3. **Supervise** all child processes. If any child crashes, the Postmaster terminates all remaining children, cleans up shared memory, and restarts the background workers from a clean state. The Postmaster itself never executes queries. It never touches your data. It exists solely to manage the lifecycle of the processes that do. > [!IMPORTANT] > **Why this matters**: The Postmaster's crash-recovery behavior is the reason Postgres uses processes instead of threads. A crashed backend takes down only itself. The Postmaster detects the failure, resets shared memory, and re-spawns the background workers—all without a full server restart. ### Backend Processes (One Per Connection) When you connect via `psql` or your application's connection pool, the Postmaster forks a dedicated **backend process** for your session. This process: - Parses your SQL - Plans the query - Executes it against **[[Manuscript/06 - Resource Management & Processes/6.2 - Shared Buffers (The Warming Rack)|Shared Buffers]]** - Returns results to your client Each backend is an independent OS process with its own memory space (for `work_mem` sorts, hash tables, etc.) but sharing the global `shared_buffers` pool. When you disconnect, the backend exits. ### The Background Workers The remaining processes run continuously in the background, performing maintenance that keeps the engine healthy between—and during—your queries. ### Checkpointer Periodically flushes all **dirty pages** from Shared Buffers to disk and writes a **Checkpoint Record** to the WAL. After a checkpoint, Postgres knows it can recover from that point forward without replaying older WAL segments. Controlled by `checkpoint_timeout` (default: 5 minutes) and `max_wal_size`. → *Introduced in [[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Pocket Diary)|5.1 The Pocket Diary]]* ### Background Writer Proactively scans Shared Buffers for dirty pages and flushes small batches to disk in quiet intervals. Its goal is to ensure that when a backend needs a free buffer, one is already available—preventing the backend from stalling to perform an expensive synchronous write. Controlled by `bgwriter_delay` (default: 200ms). → *Introduced in [[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Pocket Diary)|5.1 The Pocket Diary]]* ### WAL Writer Flushes the **WAL Buffers** (in-memory WAL records) to the WAL files on disk. When `synchronous_commit = on`, individual backends perform their own `fsync()` calls, but the WAL Writer handles the background flushing for asynchronous commits and group commit batching. Controlled by `wal_writer_delay` (default: 200ms). → *Introduced in [[Manuscript/05 - Durability & Transactions/5.1.1 - Commit Tuning (The Loose Handshake)|5.1.1 The Loose Handshake]]* ### Autovacuum Launcher The launcher is a scheduler. It monitors table statistics and, when a table accumulates enough dead tuples, spawns **Autovacuum Worker** processes to reclaim space and update visibility maps. The number of concurrent workers is limited by `autovacuum_max_workers` (default: 3). → *Covered in depth in [[Manuscript/06 - Resource Management & Processes/6.4 - Vacuum & Freezing (The Housekeepers)|6.4 The Housekeepers]]* ### Logical Replication Launcher If logical replication is configured, this process manages the lifecycle of **Logical Replication Workers**—the processes responsible for applying changes received from a publisher to the local database. → *Referenced in [[Manuscript/05 - Durability & Transactions/5.3 - Logical Replication (The Town Crier)|5.3 The Town Crier]]* ### Startup Process The first child the Postmaster spawns on boot. It replays WAL records to bring the database to a consistent state after an unclean shutdown. On a **replica**, the startup process runs continuously, applying WAL streamed from the primary. → *Referenced in [[Manuscript/05 - Durability & Transactions/5.2 - Crash Recovery (The Recovery Parade)|5.2 The Recovery Parade]]* ### Stats Collector Aggregates runtime statistics from all backends—row counts, block hits, sequential scans, index usage—and writes them to the `pg_stat_*` views. This data feeds the **Query Planner's** cost estimates and the **Autovacuum Launcher's** threshold decisions. Without it, both systems are flying blind. ### Archiver Active only when `archive_mode = on`. When the WAL Writer finishes a 16MB WAL segment, the Archiver copies it to a designated backup location (local directory, S3, etc.). This is the foundation of **Point-in-Time Recovery (PITR)**—the ability to restore the database to any arbitrary moment in the past. ### Logger Active when `logging_collector = on`. Captures all server log output (errors, warnings, slow queries) and writes it to log files, handling rotation and retention. Without it, log output goes to `stderr` only. ### WAL Sender / WAL Receiver Used in **streaming replication**. The **WAL Sender** runs on the primary and streams WAL records over the network to replicas. The **WAL Receiver** runs on the replica and feeds the received records to the **Startup Process** for replay. → *Referenced in [[Manuscript/08 - Distributed Scaling & Clouds/8.0 - Distributed Storage (The Elephant in the Clouds)|Chapter 8 - Distributed Storage]]* ### The Full Picture ``` postmaster (PID 1) ├── startup (crash recovery / WAL replay) ├── checkpointer ├── background writer ├── walwriter ├── stats collector ├── autovacuum launcher │ └── autovacuum worker (spawned on demand) ├── logical replication launcher │ └── logical replication worker (spawned on demand) ├── archiver (if archive_mode = on) ├── logger (if logging_collector = on) ├── wal sender (if replicas are connected) ├── backend (connection 1) ├── backend (connection 2) └── backend (connection N) ``` Every process in this tree shares the same `shared_buffers` segment. Every process writes to the same WAL. The Postmaster watches them all, and if any single one fails, it resets the shared state and rebuilds the family from scratch. --- ## 6.2 The Warming Rack (Shared Buffers) ![[assets/arch_shared_buffers_rack.png|450]] In the Elephant Cafe, the **[[Manuscript/02 - Physical Storage & MVCC/2.4 - Relation (The Table)|Filing Cabinet]]** is the ultimate destination for data, but it is too slow for frequent access. To hide the lethargy of the disk, Postgres uses a massive, shared area of RAM known as the **Shared Buffers**. This is the **Warming Rack**: a high-speed staging area where the most popular 8KB **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Pages]]** are kept hot and ready for the staff. ### The Shared Memory Structure In Postgres, every active query is executed by a completely separate, isolated worker process. Because these workers are isolated for stability, they cannot naturally see each other's memory. To solve this, Postgres allocates `shared_buffers` at boot time as a massive block of **Inter-Process Communication (IPC) Memory**. This is the **Warming Rack**: a centralized, high-speed staging area where all isolated workers can access the exact same 8KB **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Pages]]** without constantly copying data back and forth. When a query needs a page, Postgres doesn't just wander over to the disk; it first performs a high-speed lookup in the **Buffer Mapping Hash Table**. - **Buffer Hit**: The hash table points to a slot in RAM. The page is fetched in nanoseconds. - **Buffer Miss**: The hash table returns nothing. The engine must issue a physical `pread()` to the OS, find a free slot on the rack, and load the page from disk. ### The Divided Registry (Mapping Locks) If hundreds of backends all tried to check the Warming Rack at the same time, they would collide at the registry. To prevent this, Postgres divides the **Buffer Mapping Hash Table** into **128 separate partitions** (by default). Each partition has its own **Lightweight Lock (LWLock)**. This allows many staff members to check for different pages simultaneously without waiting for a single global lock. It is the architectural difference between a single receptionist and a row of 128 self-check-in kiosks. ### The Bolt on the Plate (Buffer Pinning) When a staff member finds a page and starts reading it, they don't just hold it in their hands. They **Pin** the buffer. Think of this as **Bolting the plate to the rack**. In technical terms, a Pin is an **Atomic Reference Count**. When a worker pins a buffer, the count goes up by one. While a buffer is pinned (reference count > 0): 1. **Eviction Protection**: The Clock Sweep is physically incapable of kicking this page off the rack to make room for new data. 2. **Structural Safety**: Other workers can still *read* the page (a Shared Pin), but nobody can *move* the physical location of the page or *repurpose* the memory slot. If you see a query waiting on a **`BufferPin`** wait event, it means an operation (like a `VACUUM` trying to physically clean up a page) has ground to a halt. It is standing next to the plate with a wrench, waiting for all other processes to finish reading and un-bolt their pins so the reference count can drop back to zero. > [!TIP] > **Why not 100% RAM?**: We usually set `shared_buffers` to **25% of system RAM**. We leave the rest for the Building Manager (the OS Page Cache). This avoids "Double Buffering"—a situation where both Postgres and the OS are hoarding the exact same page, effectively halving the Cafe's usable memory. ### The Clock Sweep (Victim Buffer Selection) Because the Warming Rack is finite, the database engine must eventually decide which pages to kick off the rack to make room for new ones. This is enforced by the **Clock Sweep Algorithm**. Imagine a clock hand sweeping across the Warming Rack. Every time the hand passes a page descriptor, it performs a mechanical check of the `usage_count`: 1. **Check the Temperature**: If the `usage_count` is greater than 0, the hand decrements it by 1 and moves to the next page. 2. **Find a Victim**: If the `usage_count` is 0, the page is a candidate to be evicted (a **Victim Buffer**). 3. **The Flash Heat**: If a staff member accesses a page while it’s on the rack, the database engine instantly bumps its `usage_count` back up (to a maximum of 5). This ensures that "Hot" pages (those with frequent hits) stay on the rack indefinitely, while "Cold" data slowly loses its metadata heat and is eventually recycled. > [!NOTE] > **Strategy Overrides**: For massive operations like a `VACUUM` or a `Sequential Scan` of a multi-terabyte table, Postgres doesn't want to "pollute" the entire Warming Rack. Instead, it uses a **Buffer Access Strategy**—a private, tiny loop of just 32 pages—to avoid kicking out all the useful, hot data that other guests are using. ### The Size of the Rack You can ask the database engine exactly how large its Warming Rack is with a simple command: ```sql -- How many pages can we keep warm? SHOW shared_buffers; -- Result: -- shared_buffers -- ---------------- -- 128MB ``` If you want to see exactly *which* suitcases are currently warm, you can use the **`pg_buffercache`** extension. It’s like a thermal camera for the database! --- ## 6.3 The Private Desk (work_mem) ![[assets/arch_private_desk_spill.png|450]] While **Shared Buffers** (Chapter 7.2) are communal, many operations require private, high-speed scratch space. When a Food Runner needs to perform a **[[Operations/ResultSet/Sort|Sort]]** or build a **[[Operations/ResultSet/HashJoin|Hash Join]]**, they cannot use the communal warming rack—it would be too slow and would displace everyone else's hot data. Instead, every individual operation is dynamically allocated a **Private Desk** called **`work_mem`**. ### The Tuplesort Mechanic When Postgres executes an `ORDER BY`, it initiates a **Tuplesort**. This is a specialized worker process that attempts to organize the data entirely within the physical bounds of your `work_mem`. - **In-Memory Sort (Quicksort)**: If the entire destination set fits on the Private Desk, Postgres performs a lightning-fast Quicksort entirely in RAM. - **The Desk Spill (External Merge Sort)**: If the data exceeds `work_mem`, the Tuplesort transitions to an **External Merge Sort**. The meticulous engine breaks the data into chunks, physically writes them to **Logical Tapes** (temporary disk files in `base/pgsql_tmp/`), and merges them back together. Spilling is disastrous for performance. It forces Postgres to move from ~100 ns RAM access speeds to ~100 µs SSD access speeds—three orders of magnitude slower. A 10MB spill can turn a sub-second query into a multi-second ordeal. ### The Hash Join Batches Joins rely just as heavily on the Private Desk. When executing a Hash Join, the engine reads the smaller of the two tables and attempts to build an **In-Memory Hash Table** on the desk to perform O(1) lookups. If the hash table exceeds the `work_mem` limit, the planner panics. It splits the operation into **Batches**, dumping the overflowing data into temporary spill files on disk. Instead of a single swift lookup, the engine is forced into a multi-pass operation, dramatically increasing I/O latency. ### The Multiplication Hazard Unlike `shared_buffers`, which is a single block of memory allocated once at startup, `work_mem` is allocated **per operation.** > [!CAUTION] > **The Memory Explosion** > If a query plan contains two **Sort** nodes and one **Hash Join**, that single query can consume **3x** the `work_mem` setting. If you have 100 concurrent connections running that exact query, they will demand **300 units** of `work_mem`. > > Setting `work_mem` to 64MB sounds perfectly safe until 100 users try to run a complex dashboard simultaneously, instantly demanding **19.2 GB of RAM**. If total transient demand exceeds physical RAM, the operating system (the Building Manager) will intervene with the **OOM Killer**—forcefully executing the most resource-heavy Postgres process to save the server from crashing. ### The Contractor's Workbench (maintenance_work_mem) There is a separate, special class of memory reserved for administrative tasks like `CREATE INDEX` or `VACUUM`. This is the **`maintenance_work_mem`**. Think of it as the **Contractor's Workbench**. Because these operations are generally executed sequentially by database administrators rather than triggered thousands of times a second by web users, this workbench can safely be set much larger than standard `work_mem` (often exactly large enough to fit an entire table schema, such as 1GB or 2GB). Giving a `VACUUM` worker a massive workbench allows it to clean the entire cafe floor in a single, lightning-fast pass without stopping to write to temporary files. ### Adjusting the Desk If the meticulous engine is struggling with a messy sort, you can temporarily give it a larger desk for just your active session: ```sql -- Giving a staff member more room to breathe SET work_mem = '64MB'; -- Checking the result of a messy sort EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM supply_deliveries ORDER BY quantity_kg; -- Look for: "Sort Method: quicksort Memory: 4500kB" ``` If you see **`Sort Method: external merge`** or **`Hash Batches: > 1`** in the EXPLAIN output, the desk was too small and the engine spilled your data to the floor. --- ## 6.4 The Housekeepers (Vacuum & Freezing) ![[assets/arch_maintenance_crew.png|450]] In the Elephant Cafe, data never truly vanishes—it just fades. Old row versions (dead tuples) linger in their shipping containers like ghostly, spectral suitcases, taking up valuable space. If left alone, they create **[[Manuscript/06 - Resource Management & Processes/6.5 - Tuple Bloat (The Great Overflow)|Bloat]]**, forcing the engine to work harder to find the live data. Furthermore, every suitcase is stamped with a 32-bit Transaction ID that will eventually expire, leading to the identity crisis of **[[Manuscript/05 - Durability & Transactions/5.6 - Transaction ID Wraparound (The Infinite Calendar)|XID Wraparound]]**. To keep the Depot running smoothly, Postgres employs a tireless **Maintenance Crew** (Background Workers) to sweep out the ghosts and stamp the permanent history. ### The RAM Managers: BGWriter vs. Checkpointer There are two primary workers responsible for managing the flow of data between the **[[Manuscript/06 - Resource Management & Processes/6.2 - Shared Buffers (The Warming Rack)|Warming Rack (RAM)]]** and disk. They serve different, essential masters: #### 1. The Background Writer (Utility) The **Background Writer** is a proactive cleaner. Its master is **Efficiency**. It scans the Warming Rack for "dirty" pages and gently flushes them to disk *before* anyone actually needs the slot. This ensures that when a new Staff Member arrives with a fresh plate, they don't have to wait for a long, slow disk-write before they can sit down. #### 2. The Checkpointer (Safety) The **Checkpointer** is a formal foreman. Its master is **Durability**. At regular intervals, it ensures that every promise written in the **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Pocket Diary)|Pocket Diary (WAL)]]** has been physically etched into the Filing Cabinet. Once finished, it marks the diary so old logs can be recycled. > [!NOTE] > **The Difference**: The BGWriter moves data to make **RAM** faster; the Checkpointer moves data to make the **WAL** safe. ### The Cleanup Crew: Autovacuum The most complex housekeeper is **Autovacuum**. Its job is to reclaim space from the ghostly "Dead Tuples" left behind by updates and deletions—think of the **Raccoon** sweeping the Depot floor to find space for new arrivals. It relies on two physical maps stored alongside your table: - **Visibility Map (VM)**: A bitset indicating the status of each page. Unlike a simple true/false toggle, the modern VM uses **Two Bits** per page: - **Bit 0 (All-Visible)**: If set, every suitcase in the container is visible to everyone. The bouncer at the door (The Planner) uses this to perform **Index-Only Scans** without opening the table. - **Bit 1 (All-Frozen)**: If set, every suitcase in the container has been officially frozen (marked as permanent history). The Housekeepers use this to skip the page entirely during anti-wraparound runs, saving massive amounts of I/O. - **Free Space Map (FSM)**: A record of exactly how much reusable space remains on each 8KB page (each shipping container), helping future inserts find a home. ### The Shared Lock Crowding (Multixacts) A tuple header is small; it only has room for a single `xmin` and `xmax`. What happens if ten different patrons all want to place a "Shared Lock" (`SELECT FOR SHARE`) on the same dish? Instead of trying to stuff ten IDs into one 4-byte slot, Postgres moves the coordination to a separate side-file called **`pg_multixact`**. Postgres puts a single "MultiXact ID" in the suitcase header, which acts as a claim check for a list of lockers in the side-file. > [!WARNING] > **Maintenance Tax**: While this allows infinite shared locking, it creates a new maintenance burden. The Housekeepers must now clean up the old `pg_multixact` files as well. If your application uses excessive shared row-locking, your `autovacuum` might spend more time cleaning the "Overflow Pockets" than the actual tables. ### The Two-Phase Vacuum Vacuuming a table is not a single swipe; it is a mechanical two-phase operation: 1. **Phase 1: Heap Scan**: Autovacuum scans the table's "Heap" (the pages of suitcases). It collects the physical addresses (**TIDs**) of all dead tuples and stores them in memory (`maintenance_work_mem`). 2. **Phase 2: Index Cleanup**: Once the "Master List" of dead TIDs is collected, Autovacuum visits every **[[Manuscript/03 - Access Paths & Indexing/3.1 - B-Tree (The Balanced Bookshelf)|Index (Map)]]** associated with the table. It removes the pointers that point to the dead TIDs. Only after the indexes are clean can the original space in the heap be marked as "Free" in the FSM. > [!IMPORTANT] > **Maintenance Budget**: If `maintenance_work_mem` is too small, Autovacuum can't fit enough dead TIDs in memory. It is forced to stop the heap scan, go clean the indexes, and then *come back* to restart the heap scan. This causes a massive performance drag. ```sql -- Is the vacuum crew keeping up? SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10; ``` If you see a high `n_dead_tup` count on any table, the crew is falling behind. By default, autovacuum triggers when dead tuples exceed **20%** of live tuples (`autovacuum_vacuum_scale_factor = 1.20`). For high-write tables, you'll want to lower that threshold: ```sql -- Tell the vacuum crew to be more aggressive on this busy table ALTER TABLE orders SET ( autovacuum_vacuum_scale_factor = 1.01, -- trigger at 1% dead tuples autovacuum_vacuum_cost_limit = 1000 -- give them a bigger broom ); ``` ### The Freeze Ritual (FREEZE) Beyond mere cleaning, the maintenance crew has a second, critical duty: **Freezing old rows**. As we learned in **[[Manuscript/05 - Durability & Transactions/5.6 - Transaction ID Wraparound (The Infinite Calendar)|Chapter 5.6]]**, every row is stamped with a 32-bit transaction ID (`xmin`). With only ~4 billion available IDs, a long-lived database will eventually cycle back to the beginning—causing Postgres to confuse "ancient history" with "this afternoon." To prevent this, the crew performs a **Freeze** on rows that are old enough to be considered permanent—think of the **Frog** applying a golden seal to a suitcase in the Depot. A frozen row has its `xmin` replaced with a special `FrozenTransactionId` marker, which tells every future transaction: *"This row has always been visible. Don't question it."* ```sql -- How close are our tables to the XID wraparound danger zone? SELECT relname, age(relfrozenxid) AS xid_age, relpages FROM pg_class WHERE relkind = 'r' ORDER BY age(relfrozenxid) DESC LIMIT 10; ``` The older the `xid_age`, the more urgently the table needs a vacuum pass to freeze its rows. The danger threshold is ~**2 billion transactions**. You can also force a full freeze on any table at any time: ```sql -- Manual emergency freeze for a specific table VACUUM FREEZE orders; ``` > [!WARNING] > Never disable `autovacuum`. It is not optional. Disabling it is a common mistake on "optimized" systems that leads to XID wraparound panics, ballooning table sizes, and eventually a Postgres emergency shutdown. The housekeepers must always be allowed to work. ### The Checkpoint Spike If the Checkpointer is too aggressive—or if writes are extremely heavy—the moment of every checkpoint can flood the I/O elevator with dirty pages, causing a visible performance spike. This is why `checkpoint_completion_target` (default 1.9) exists: it tells the Checkpointer to *spread* the flush over 90% of the checkpoint interval, rather than trying to write everything at once. Without these housekeepers, Postgres would eventually be buried in a mountain of dead suitcases, face an identity crisis from XID exhaustion, and bring the entire Cafe to a grinding, catastrophic halt. And that would be a very poor tea party indeed. --- ## 6.5 The Great Overflow (Tuple bloat) ![[assets/arch_vacuum_bloat.png|450]] When the staff is updating and deleting records faster than the housekeepers can clean, the database enters a state of **Bloat**. This is a physical crisis: even though your "live" data remains the same size, the Filing Cabinet (the data file on disk) begins to swell with "shredded paper"—uncollected dead tuples that have not yet been recycled. ### The Mechanic of Table Bloat In **[[Manuscript/02 - Physical Storage & MVCC/2.5 - MVCC (The Sharpie Ledger)|MVCC]]**, deletes and updates leave "ghost" records behind. Under normal circumstances, **Autovacuum** sweeps these records and marks the space as "Free" in the **Free Space Map (FSM)**. Future inserts will find these empty holes and fill them. The file size remains stable. However, if a transaction stays open for too long (a **Long-Running Transaction**), Autovacuum cannot clean the rows that were deleted *after* that transaction started. Postgres must keep those dead rows around just in case the long-running guest needs to see them. The scraps pile up. If new data arrives while the floor is covered in un-sweepable scraps, the database engine has no choice but to add **New Pages** to the end of the file. The database is now physically larger than it needs to be. ### The Librarian's Dilemma: Index Bloat While `VACUUM` can eventually clean the heap, it struggles with the **[[Manuscript/03 - Access Paths & Indexing/3.1 - B-Tree (The Balanced Bookshelf)|Index (The Librarian's Map)]]**. When a row is deleted, the pointer in the B-Tree doesn't just evaporate. The page in the index might become 90% "empty," but the B-Tree structure remains. Because an index is a sorted tree, the Librarian cannot easily "merge" partially empty pages without a massive, expensive re-shuffling of the entire bookshelf. This is **Index Bloat**. A bloated index makes every search slower because the Runner has to traverse more levels of the tree and read more pages just to find the same amount of data. > [!WARNING] > **The Vacuum Trap**: Standard `VACUUM` does **not** return disk space to the Operating System. It only makes the space available for future Postgres data. To truly shrink a 100GB file back down to 10GB, you must perform a structural reclamation. ### Reclaiming the Space (The Nuclear Options) If the bloat is so catastrophic that it threatens to fill the entire disk, you have three mechanical choices: 1. **`REINDEX`**: This throws away a bloated index and builds a fresh, compact one from scratch. It is the best way to fix "The Librarian's Dilemma." 2. **`VACUUM FULL`**: The ultimate structural repair. Postgres creates a brand-new, perfectly compact data file and moves every live suitcase into it. This requires a **[[Manuscript/07 - Wait Events & Concurrency/7.5 - Heavyweight Locking (The Narrow Gate)|AccessExclusiveLock]]**—the Cafe is closed to all patrons until the move is finished. 3. **Internal `pg_repack`**: A community tool that performs a `VACUUM FULL` in the background without locking the table for long periods. It is the "Professional Mover" of the Postgres world. ### The Diagnostics: Checking the Bins To find out if your housekeepers are falling behind, you must look at the **Waste Audit**: ```sql -- How much shredded paper is on the floor? SELECT relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE relname = 'supply_deliveries'; ``` If **`n_dead_tup`** (dead tuples) is significantly higher than **`n_live_tup`**, Postgres is drowning in its own trash! ### The Solution: Tuning the Cleaners The database engine is very conservative. By default, it only starts cleaning when 20% of a table is dead. For a table with a billion rows, that’s 200 million dead rows before the vacuum even turns on! To maintain a healthy database, you must adjust the **Thresholds of Cleanliness**: ```sql -- Telling the Housekeeper to be more aggressive on this table ALTER TABLE supply_deliveries SET ( autovacuum_vacuum_scale_factor = 1.01, autovacuum_vacuum_cost_limit = 1000 ); ``` By lowering the `scale_factor` to 1% (1.01), we ensure the housekeeper starts sweeping much earlier. By increasing the `cost_limit`, we give them a bigger broom! > [!TIP] > **The VacuumDelay Wait Event**: When Autovacuum hits its `cost_limit`, it intentionally goes to sleep so it doesn't consume all your I/O bandwidth. If you use **[[Manuscript/07 - Wait Events & Concurrency/7.1 - Wait Events (The Relay Race)|pg_wait_tracer]]** and see Autovacuum spending 90% of its time in the `Timeout:VacuumDelay` state, it means your housekeeper is sleeping too much and sweeping too little. Increase your `cost_limit` or reduce the `cost_delay`! ### The Nuclear Option (VACUUM FULL) If the bloat is so bad that you absolutely *must* reclaim the disk space immediately, you can order a **`VACUUM FULL`**. Be warned: this is the **Nuclear Option**. Postgres will lock the entire table heap, move every single suitcase into a brand-new cabinet, and throw the old cabinet in the river. No one can order tea, no one can pay their bill, and the Cafe is effectively closed for the duration of the move. Great engineers use `VACUUM FULL` only in the direst of emergencies! It’s more expensive and requires more "staging area" (disk space), but it ensures the Cafe stays open while the trash is removed! ### Summary: Postgres's Menu (The Tuning Matrix) Tuning the database engine's appetite is about balancing speed and safety. Use this matrix to guide your next configuration: | Setting | Role | The "Lazy" Wisdom | | :------------------------- | :---------------------- | :-------------------------------------------------------- | | `shared_buffers` | The Warming Rack | Set to ~25% of RAM. Leave the rest for the Building Manager (OS). | | `work_mem` | The Private Desk | Set low (4-16MB) by default. Set high for specific messy sessions. | | `maintenance_work_mem` | The Housekeeping Budget | Set high (256MB-1GB) so the Vacuum Crew has a bigger broom. | | `max_wal_size` | The Pocket Size | Larger keeps the checkpointer from panicking; smaller saves disk. | | `autovacuum_scale_factor` | The Mess Threshold | Lower it for busy tables so the cleaners start earlier. | | `vacuum_cost_limit` | The Cleaning Speed | Increase it to let the cleaners work faster at the cost of I/O. | By mastering the Warming Rack and the Housekeeper's Budget, you ensure the Elephant Cafe is lean, fast, and always ready for the next lunch rush. --- Now that Postgres has a full stomach and a clean engine, it needs to worry about the **Crowds**. In the next chapter, we'll explore **The Waiting Game (Workloads & Locking)**, where we see what happens when too many elephants try to use the same suitcase at once! --- ## 6.6 The Physical Machine (RAM, CPU & Disk) ![[assets/arch_hardware_evolution.png|450]] The previous sections described how Postgres *manages* its memory budgets. This section describes what is actually happening in the hardware and OS kernel when those budgets are exercised — because the decisions Postgres makes only make sense once you can picture what is happening below the abstraction. ### RAM: Two Caches, Not One There is a common misconception about `shared_buffers`: that it *is* the Postgres memory, and that everything else goes to disk. The reality is more layered. When Postgres reads a page from disk, it calls `pread()` — a standard POSIX syscall. The kernel intercepts this call and routes it through the **Linux Page Cache** (also called the **Buffer Cache**), a kernel-managed pool of recently accessed file pages. The kernel doesn't know anything about Postgres tables or 8KB pages; it just sees file offsets. So when a page arrives in Postgres, it has actually passed through *two* independent caches: 1. **The OS Page Cache**: Kernel-managed. Contains file pages as the kernel understands them. Lives in ordinary kernel address space. 2. **`shared_buffers`**: Postgres-managed. Contains those same pages mapped into Postgres's shared memory segment, indexed by the Buffer Mapping Hash Table. This is the reason `shared_buffers` is capped at 25% of RAM. The OS Page Cache is not idle; it is actively caching the same files below Postgres's sight line. Giving Postgres 90% of RAM doesn't eliminate that second cache — it just starves it. On workloads with large sequential scans or heavy checkpoint I/O, the OS Page Cache does meaningful, non-redundant work. > [!NOTE] > **The `O_DIRECT` exception**: Some systems configure Postgres to use `O_DIRECT` for WAL writes, bypassing the OS Page Cache entirely for those files. This avoids double-buffering the WAL. It is not the default. ### How `shared_buffers` is Allocated At startup, Postgres allocates `shared_buffers` as a single large **shared memory segment** using `shmget()` or `mmap()` (depending on the OS and configuration). All backend processes attach to this same segment. This means every backend reading or writing a page in the buffer pool is accessing the exact same physical memory — no copying, no per-process duplication. This also means that when you set `shared_buffers = 8GB`, Postgres reserves 8GB of **physical DRAM** on startup, not virtual address space. The OS cannot page it out. ### CPU: When I/O Is Not the Bottleneck Once a page is in `shared_buffers`, the query engine starts doing CPU work: - **Tuple evaluation**: Walking the rows in a page and evaluating WHERE clause predicates. - **Hash computation**: Building hash tables for Hash Joins and Hash Aggregations. - **Sort passes**: Quicksort in-memory for Sort nodes. - **Expression evaluation**: Running functions, type coercions, and operator calls. A query whose working set fits entirely in `shared_buffers` is CPU-bound, not I/O-bound. Throwing more RAM at a CPU-bound query won't help. Adding an appropriate index to eliminate the tuple evaluation loop will. This distinction — **I/O-bound vs. CPU-bound** — is the central diagnostic axis of Chapter 7. Wait events tell you which world your query is living in. > [!TIP] > On modern multi-core hardware, Postgres can execute a single query using **parallel workers** (see `max_parallel_workers_per_gather`). Each worker is a separate OS process with its own CPU affinity. Parallel queries help on CPU-bound aggregations and sorts; they are irrelevant if the bottleneck is a single slow disk read. ### Disk: A History of I/O Assumptions When a page is not in either cache, the database engine must fetch it from the block device. The kernel translates the `pread()` call into an I/O request submitted to the device driver. What happens next has changed dramatically over the history of database infrastructure — and will continue to change. Postgres's defaults encode a specific set of assumptions about what "storage" costs. Understanding those assumptions, and when they were made, is the key to tuning correctly for hardware you are actually running. ### Era 1: Spinning HDD (The Physical Constraint) A spinning hard drive reads data by moving a physical read head across a rotating magnetic platter. The mechanical latency is real and unavoidable: a **rotational seek** takes 4–10 ms, bounded by the physics of moving steel. The consequences for a database are severe. Reading 10,000 random 8KB pages from an HDD: ``` 10,000 seeks × 8 ms average = 80 seconds ``` Sequential reads are dramatically cheaper — the head doesn't need to move far. This asymmetry between random and sequential I/O is the foundational assumption baked into Postgres's query planner. `random_page_cost` defaults to **5.0**. That number was chosen for spinning disk. It tells the planner: "a random read is four times more expensive than a sequential read." On spinning disk, this is conservative — the real ratio is often 50–100×. ### Era 2: SATA SSD (The First Inversion) SATA SSDs have no moving parts. Random access latency drops to ~500 µs — roughly 10–20× faster than a seeking HDD. But SATA SSDs still use the SATA interface, which was designed for spinning disks. The interface itself is a bottleneck: SATA caps at ~550 MB/s throughput and can't issue many parallel I/O operations concurrently. On a SATA SSD, the planning assumptions begin to break down. The `random_page_cost = 5.0` default overstates the penalty for random reads. Many workloads benefit from lowering it to ~3.0. ### Era 3: NVMe (The Interface Disappears) NVMe SSDs connect over PCIe lanes directly to the CPU. They have no mechanical seek, no SATA bottleneck, and can service tens of thousands of I/O operations in parallel. Random access latency is ~100 µs or lower. At this point, the distinction between "random" and "sequential" reads has nearly vanished at the hardware level. The `random_page_cost = 5.0` Postgres default is now wrong by a factor of several. ```sql -- For NVMe: close the planning gap between random and sequential ALTER SYSTEM SET random_page_cost = 2.1; SELECT pg_reload_conf(); ``` > [!WARNING] > Don't set `random_page_cost = 2.0`. Even NVMe has a real cost relative to L2 cache (100 µs vs. 7 ns — a 14,000× difference). The planner needs *some* signal that disk is not free. ### Era 4 and Beyond: Persistent Memory (The Category Blurs) **Persistent Memory (PMEM / Optane)** sits in DIMM slots alongside DRAM, but survives power loss. Latency is ~300 ns — faster than NVMe by 300×, and only 3× slower than DRAM. This completely dissolves the traditional RAM/disk boundary. Storage-class memory architectures like this have already influenced database design. **Neon** and **Amazon Aurora** separate compute from storage entirely, replacing the local disk with network-attached page stores. WAL is shipped over the network instead of written to local disk. The lesson is not a specific set of numbers to memorize. It is this: > **Postgres's defaults are calibrated to an era. The hardware has already changed twice since those defaults were written. It will change again.** Every time you deploy Postgres on a new infrastructure — a new instance type, a new cloud region, a new disk tier — the correct behavior is to benchmark actual I/O latency and re-calibrate `random_page_cost`, `seq_page_cost`, and `effective_cache_size` accordingly. The database engine has no way to detect that it's running on NVMe instead of a 2003-era SCSI disk. That judgment is yours. ```sql -- effective_cache_size tells the planner how much OS Page Cache is available. -- It does not allocate memory; it only influences index-vs-seqscan decisions. SHOW effective_cache_size; -- A reasonable starting point: about 75% of total system RAM ALTER SYSTEM SET effective_cache_size = '24GB'; -- for a 32GB server SELECT pg_reload_conf(); ``` > [!NOTE] > **The Analogy Limit**: The Warming Rack metaphor accurately describes the *role* of `shared_buffers`, but it implies a static, fixed kitchen. Real storage hardware is not static. The "size" of the room, the "speed" of the pantry, and the "cost" of walking to it are all parameters of the hardware generation you happen to be running — not constants of the universe. ### Observing Hardware Pressure: `pg_stat_io` Postgres 16 introduced `pg_stat_io`, a view that exposes per-backend, per-context I/O counters. This is the most direct window into whether your hardware is actually being exercised: ```sql -- Which backend types are hitting disk the most? SELECT backend_type, object, context, reads, hits, writes, evictions, ROUND(hits::numeric / NULLIF(reads + hits, 0) * 100, 1) AS hit_rate_pct FROM pg_stat_io WHERE reads + hits > 0 ORDER BY reads DESC; ``` A `hit_rate_pct` near 100% means the Warming Rack is doing its job and reads are being served from RAM. As `hit_rate_pct` drops below ~95% on a busy OLTP system, the buffer pool is too small for the working set and physical disk I/O is becoming a significant cost. ```sql -- Reset to get a fresh baseline after config changes SELECT pg_stat_reset_shared('io'); ```