# Chapter 5: Durability & Transactions ## 5.0 - Write-Ahead Log (Safety Without Sweating) <img src="assets/chap_4_safety.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Postgres is built on a simple, uncompromising trust: once a transaction is confirmed, the data must survive any system failure. This reliability is the engine's primary currency. If the system loses power or the operating system crashes, the promises made to the user must remain etched in reality. ### What You'll Learn - Why direct page updates are an architectural catastrophe (Torn Writes, Random I/O, Atomicity failures) - How the **Write-Ahead Log (WAL)** provides durability through sequential, append-only writes - How **Transactions** and the WAL together guarantee **Atomicity** — all-or-nothing commit semantics - How Postgres recovers from a crash by replaying WAL records from the last Checkpoint To understand how Postgres maintains this trust, we must first look at the most intuitive—and most flawed—way to persist data. ### The Naive Model: Direct Page Updates In a naive world, every time you update a **Tuple** (a record's tuple), Postgres would immediately find the relevant **8KB [[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Page]]** (page) on disk and overwrite it with the new data. It is an obvious strategy, but in production, it is an architectural catastrophe. ### Why Direct Persistence Fails There are three fundamental reasons why Postgres cannot update its data files directly for every change: 1. **The Random I/O Speed Trap**: Operating systems and storage hardware are significantly faster at writing one continuous stream of data (**Sequential I/O**) than jumping between isolated locations on disk (**Random I/O**). Erasing and re-writing an entire 8KB page just to change a single 10-byte tuple is like repainting an entire warehouse because of a smudge on a single crate. 2. **The Torn Write Risk**: If the system loses power mid-write, an 8KB page may only be partially updated. The result is a **Torn Write**, where half the page represents the old state and half represents the new. Without a backup, this results in permanent physical corruption. 3. **The Atomic Failure**: Complex operations (like money transfers) often span multiple rows in different pages. If the system crashes after the first page is written but before the second, the database is left in an impossible, inconsistent state. This fundamental binary — the speed of the sequential log versus the permanence of the random heap — is the most important trade-off in database architecture. ### The Solution: The Write-Ahead Log (WAL) To solve this, Postgres never writes directly to its data files during a transaction. Instead, it maintains a sequential, append-only **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Durable Ledger)|Write-Ahead Log (WAL)]]**. Rather than rewriting an 8KB page on disk, the engine appends a tiny record to the WAL describing the change: *"At LSN 0/1A2B, in page 88 at offset 24, replace these 40 bytes with these 40 bytes."* Because the WAL is written sequentially, it is incredibly fast to flush. Once the record is physically on disk (`fsync`'d), Postgres considers the transaction committed. This is the foundation of **Durability**. If the power dies, Postgres wakes up, reads the WAL forward from its last known **Checkpoint**, and re-applies any records that hadn't yet been propagated to the main data files. The actual 8KB heap pages are updated lazily in the background without blocking the user. ### The Atomicity Guarantee (Transactions) The WAL also enables the **[[Manuscript/05 - Durability & Transactions/5.4 - Transactions (The Atomic Seal)|Atomic Seal (Transaction)]]**. This is the guarantee of **Atomicity**: either every action in a session is finalized, or none of them are. By using the **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Durable Ledger)|Durable Ledger]]** to track the beginning and end of a related set of changes, the engine can simply "rewind" or ignore any partial promises it finds in the log after a crash. Postgres ensures that a record in the WAL is as durable as an entry in a master ledger. This allows the engine to be fast while remaining indestructible—giving us the peace of mind to trust our data even in large-scale, distributed environments. --- ## 5.1 - WAL & fsync (The Durable Ledger) <img src="assets/arch_wal_diary.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Writing directly to data files for every transaction is inefficient. Random 8KB writes are expensive, and a crash during a write can result in an unrecoverable "torn page." Postgres ensures durability via the **Write-Ahead Log (WAL)**. The engine records every change in a sequential log *before* modifying the data heap. Sequential log flushes are significantly faster than the random I/O required to update data files across the heap. Every change is encapsulated in an **`XLogRecord`**—a binary header followed by a physical delta. ### Anatomy of an `XLogRecord` ```text ┌───────────────────────────────────────────────────────────┐ │ XLogRecord │ │ │ │ xl_tot_len │ xl_xid │ xl_prev │ xl_rmid │ ├──────────────┴─────────────┴─────────────┴─────────────────┤ │ PAYLOAD DATA │ └───────────────────────────────────────────────────────────┘ ``` - **`xl_tot_len`**: Total record size. - **`xl_xid`**: Transaction ID. - **`xl_prev`**: Pointer to the previous record's LSN (a contiguous chain). - **`xl_rmid`**: The **Resource Manager ID** (Heap, B-Tree, etc.) that interprets the payload. - **Payload**: The raw binary "delta." ### The Log Sequence Number (`LSN`) Every WAL record is identified by a **Log Sequence Number (LSN)**—a 64-bit coordinate representing its byte-offset in the database's log history. The LSN is a monotonically increasing value that serves as the engine's primary timeline. ```sql SELECT pg_current_wal_lsn(); -- 0/16A5E88 INSERT INTO ingredients (name, category) VALUES ('Peanut', 'Nut'); SELECT pg_current_wal_lsn(); -- 0/16A5FA0 (It moved!) ``` ### Durability and the `fsync()` Syscall A transaction is only durable once its WAL records are persisted to disk. Postgres uses the **`fsync()`** system call to ensure that data is physically written to the storage media before the transaction is considered committed. The engine waits for the kernel to confirm the persistence of the WAL data. > [!TIP] > **Living Dangerously**: If you are willing to risk a fraction of a second of data for a massive throughput boost, you can perform an **Asynchronous Commit** by setting `synchronous_commit = off`. See **[[Manuscript/05 - Durability & Transactions/5.1.1 - Commit Tuning (The Loose Handshake)|5.1.1 Asynchronous Commit]]** for the technical details. > [!TIP] > **Group Commit**: To minimize the cost of `fsync()`, Postgres uses **Group Commit**. The WAL writer can aggregate multiple concurrent transactions into a single flush operation. You can tune this behavior using **`commit_delay`** and **`commit_siblings`**. ### Persisting the Heap: Checkpoints and Background Writing While the WAL ensures durability, the actual data files are updated asynchronously via two background processes: 1. **Shared Buffers (Dirty Pages)**: When a page is modified in memory, it becomes "Dirty." 2. **The Checkpointer**: Every few minutes, the Checkpointer flushes all dirty pages to disk and writes a **Checkpoint Record** to the WAL. This mark guarantees that earlier WAL records are no longer required for crash recovery. 3. **The Background Writer**: This process scans for dirty pages and flushes them in small batches to ensure that clean buffers are always available for incoming data. ```mermaid graph TD subgraph "The Durability Boundary (Synchronous)" A[UPDATE Statement] -->|"1. Append"| B[(WAL Buffer)] B -->|"2. fsync"| C[WAL File on Disk] C -.->|"3. Atomic Seal"| A end A -->|"4. Dirty"| D[Shared Buffers] subgraph "Deferred Persistence (Asynchronous)" D -.->|"5. Flush"| E[(Data Files)] end style C fill:#f96,stroke:#333,stroke-width:4px style E fill:#e3f2fd,stroke:#1565c0 linkStyle 0,1,2 stroke:#c62828,stroke-width:2px; linkStyle 3,4 stroke:#2e7d32,stroke-width:2px,stroke-dasharray: 5 5; ``` ### Full Page Writes (FPW) There is a danger: a system crash mid-write could leave an 8KB Page half-updated. Because most hardware writes in 512-byte or 4KB chunks, this results in a **Torn Page**—a corrupted block that cannot be recovered by standard means. Postgres uses **Full Page Writes** to prevent corruption. The first time a page is modified after a checkpoint, Postgres writes the entire 8KB image to the WAL. If a torn page is discovered during recovery, the engine restores the page from the full image in the WAL before applying subsequent deltas. > [!IMPORTANT] > **The Truth Hierarchy**: > 1. **The WAL**: The immediate, absolute Truth. > 2. **Shared Buffers**: The active, in-memory Truth. > 3. **Data Files**: The deferred, eventual Truth. The whole architecture rests on a single principle: a sequential log entry is as durable as the physical data page, provided the log entry is persisted before the data file is updated. --- ## 5.1.1 - Commit Tuning (The Loose Handshake) <img src="assets/arch_wal_tweak.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> In Postgres, durability is the primary source of commit latency. As we’ve seen, the **`fsync()`** system call blocks execution until the WAL record is persisted to non-volatile storage. For write-heavy workloads, this synchronous wait is the primary bottleneck. If your application needs to handle high transaction volumes, you must choose a commit strategy that balances durability against throughput. ### 🧪 Lab Challenge: The Speed vs. Safety Trade-off **The Request**: "We need to ingest 10,000 supply deliveries per second during the morning rush. The disk is screaming, and we're hitting a wall." #### The Naive Solution By default, Postgres uses **Full Durability**. Every single order waits for the disk to whisper "Safe!" (`synchronous_commit = on`). ```sql -- Standard, safe transaction BEGIN; INSERT INTO supply_deliveries (...) VALUES (...); COMMIT; -- The engine stalls here, waiting for the fsync() handshake. ``` #### The Fallout You hit the physical limit of your disk's I/O latency. Even on a fast NVMe drive, if every transaction waits for a dedicated flush, you might top out at a few hundred transactions per second. This leads to transaction queuing and reduced throughput as the engine waits for physical persistence. #### The Lazy Fix Enable **Asynchronous Commit**. For high-volume, non-critical data, you can instruct the engine to return success before the WAL record has reached the disk. ```sql BEGIN; SET LOCAL synchronous_commit = off; -- "Don't wait for the ledger." INSERT INTO supply_deliveries (...) VALUES (...); COMMIT; -- Instant return! ``` #### The Reward Throughput increases significantly. The engine acknowledges the commit as soon as the record is written to memory. The **WAL Writer** flushes the records in an efficient batch operation shortly thereafter. > [!WARNING] > **The Data Loss Window**: If the system crashes *between* a COMMIT and the subsequent WAL flush, that data is lost. This trade-off is acceptable for telemetry or logs, but should be avoided for financial or critical transactions. ### The Group Rush: Tuning Group Commit If you choose to keep safety `on`, you can still optimize performance by helping Postgres aggregate its work. This is managed by two settings: #### 1. `commit_delay` This setting instructs the backend to wait for a specific number of microseconds before performing an `fsync()`. This allows multiple concurrent transactions to "group" their commits into a single disk write. #### 2. `commit_siblings` This is a guardrail for `commit_delay`. It specifies the minimum number of concurrent transactions required to trigger the delay. If fewer than `commit_siblings` are active, the engine flushes immediately. ### The Write Buffer: `wal_buffers` Before WAL records hit the disk, they sit in the **WAL Buffers** (the mental scratchpad). By default, this is set to `-1`, which tells Postgres to automatically allocate 1/32 of your **`shared_buffers`** (up to 16MB). For almost all workloads, this auto-tuning is perfect. However, if you are seeing wait events like `WALBuffer-Full`, increasing this can prevent the engine from stalling while waiting for the WAL writer to catch up. > [!IMPORTANT] > **Summary of Commit Levels**: > - **`on`**: The default. Wait for local disk flush. > - **`off`**: Do not wait for disk. Best for performance where minor data loss is acceptable. > - **`local`**: Used in replication (see Chapter 8). Wait for local flush but not replica flush. > - **`remote_write`**: Wait for replica to receive data but not necessarily flush it. > - **`remote_apply`**: Wait for replica to physically apply the change to its own files. --- ## 5.2 - Crash Recovery (The Recovery Parade) <img src="assets/arch_recovery_parade.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> A power loss in the middle of a write is the worst-case scenario for any database. The CPU stops mid-instruction, RAM is wiped, and the on-disk data files may contain a mixture of old and new bytes. When Postgres comes back up, it has no idea which transactions had committed, which had aborted, and which 8KB heap pages were halfway through being rewritten. The architectural payoff of the **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Durable Ledger)|Write-Ahead Log]]** is precisely this scenario: every change was recorded sequentially in the WAL *before* it was applied to the heap, so the recovery process is just a matter of replaying the log forward from a known good point. The architectural cost is the bookkeeping required to find that point and to know when it is safe to stop. ### `pg_control`: The Survival Blueprint Postgres begins every startup — clean or otherwise — by reading a single 512-byte file in `PGDATA/global/pg_control`. This is the cluster's control block, and it answers the three questions the engine needs to recover: | Field | What it tells the engine | | :------------------- | :-------------------------------------------------------------------------------------- | | `state` | Was the previous shutdown clean (`DB_SHUTDOWNED`) or interrupted (`DB_IN_PRODUCTION`)? | | `checkPoint` | The LSN of the last successful **Checkpoint** record in the WAL | | `minRecoveryPoint` | On replicas: the LSN beyond which the database is guaranteed to be consistent | If `state` is anything other than `DB_SHUTDOWNED`, the engine knows a crash occurred and immediately begins recovery rather than opening for connections. > [!IMPORTANT] > **Why `pg_control` itself can't be torn**: the file is small enough to be written atomically by a single OS sector write, and Postgres rewrites it only at well-defined moments (checkpoint completion, cluster shutdown). A torn `pg_control` would render the cluster unrecoverable, so the engine treats it as the one byte sequence it absolutely cannot lose. ### The Redo Loop Recovery starts at the **Checkpoint LSN** stored in `pg_control` and walks the WAL forward, one `XLogRecord` at a time. For each record, the engine: 1. Identifies the affected page using the record's `RelFileNode` and `BlockNumber`. 2. Reads the page from disk into shared buffers. 3. Compares the page's `pd_lsn` (the LSN of the last WAL record already reflected in this page) against the WAL record's LSN. 4. If `pd_lsn < record.lsn`, applies the change and updates `pd_lsn`. Otherwise, the change was already on disk before the crash, and the record is skipped. This LSN comparison is the entire correctness argument: every page carries a watermark, every WAL record carries a position, and replay only happens when the page is provably behind. Recovery can be re-run any number of times and will produce the same result — the operation is **idempotent**. ```text WAL: ... [LSN 0/3A12] [LSN 0/3A48] [LSN 0/3A90] [LSN 0/3B04] ... ^ ^ pg_control.checkPoint end of WAL \________________ replay ___________/ ``` When the engine reaches the end of the WAL (or `minRecoveryPoint` on a replica), recovery is complete and the cluster opens for connections. ### What Happens to Uncommitted Work Recovery replays *every* record in the WAL — including changes from transactions that never reached `COMMIT`. That sounds dangerous, but the **[[Manuscript/05 - Durability & Transactions/5.4 - Transactions (The Atomic Seal)|Commit Log (CLOG)]]** is the safety net. When recovery encounters a tuple stamped with `xmin = 47291`, the engine consults the CLOG. If transaction 47291 never wrote a `XLOG_XACT_COMMIT` record before the crash, its CLOG entry remains `IN_PROGRESS`, and the standard MVCC visibility rules treat the tuple as if it never existed. The bytes are physically present on the page, but they are invisible to every future query — and **[[Manuscript/06 - Resource Management & Processes/6.4 - Vacuum & Freezing (The Housekeepers)|Autovacuum]]** will eventually reclaim them as bloat. > [!NOTE] > **No UNDO log required**: many databases (Oracle, MySQL/InnoDB) maintain a separate UNDO log to roll back uncommitted changes during recovery. Postgres avoids this entirely by leaning on MVCC — uncommitted writes don't need to be rolled back, just left invisible. The trade-off is the bloat that accumulates from aborted transactions. ### Torn Pages and Full Page Writes There is one scenario the redo loop alone cannot handle: a page that was halfway written when the power died. Half of its 8KB has the new bytes, half has the old, and the page is no longer internally consistent — the line pointers may reference a garbage tuple, the checksum will not match. Postgres prevents this with **Full Page Writes (FPW)**. The first time a page is modified after a checkpoint, the engine copies the entire 8KB page into the WAL alongside the change record. During recovery, when a torn page is encountered, the engine ignores the on-disk version and restores the clean copy from the WAL. (See **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Durable Ledger)|5.1 WAL & fsync]]** for the FPW mechanics.) ### What You'll See in the Log When recovery runs, the server log records the journey: ```text LOG: database system was interrupted; last known up at 2026-04-27 21:43:18 UTC LOG: database system was not shut down cleanly; automatic recovery in progress LOG: redo starts at 0/3A000028 LOG: redo done at 0/3B0479F8 system usage: CPU: user 0.04 s, system 0.01 s, elapsed 0.18 s LOG: database system is ready to accept connections ``` The four lines describe exactly what happened: the engine noticed the unclean shutdown, started replaying the WAL at the last checkpoint LSN, finished replaying it, and opened for business. On a healthy cluster with frequent checkpoints, this completes in seconds even after a kernel panic. The whole architecture is a single bargain: the engine writes its intentions sequentially before it touches the data, and in exchange it gets to defer the messy work of disk reorganization indefinitely. When the lights go back on, the WAL is the only thing that has to be true. Everything else can be reconstructed from it. --- ## 5.3 - Logical Replication (The Relay Chain) <img src="assets/arch_logical_replication_crier.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Physical replication ships the **bytes** of every WAL record. The replica is a binary clone of the primary, locked to the same Postgres major version, the same page layout, and the same set of relations. That is wonderful for high-availability replicas — and useless when you need to feed a single table into a downstream warehouse, run a major-version upgrade with zero downtime, or replicate between clusters that don't share a binary footprint. **Logical Replication** solves these cases by shipping the **meaning** of each change rather than the physical delta. The architectural payoff is that the source and target can differ in version, in physical layout, even in extension set; the architectural cost is decoder CPU on the publisher and replication slot disk pressure on the source. ### Physical vs. Logical: the architectural difference | Aspect | Physical Replication | Logical Replication | | :------------------ | :-------------------------------- | :------------------------------------------- | | What ships | Raw 8KB page-level WAL records | Row-level change events ("INSERT into `dishes` (...)") | | Granularity | Entire cluster | Per-table (via publications) | | Cross-version | No (same major required) | Yes | | Cross-architecture | No (same byte order) | Yes | | Target writability | Read-only standby | Independent primary that may also accept writes | | Network volume | Higher (full pages on FPW) | Lower (just the changed row) | ### Logical Decoding The engine produces logical change events through a four-stage pipeline running inside the **WAL Sender** process on the publisher: 1. The Sender reads each `XLogRecord` as it is appended to the WAL. 2. A **Decoding Plugin** (`pgoutput` is the in-tree default) translates the binary delta into a row-shaped event keyed by `(schema, table, action, before, after)`. 3. Events are buffered until the originating transaction commits — uncommitted work is never published, so subscribers never see a row that later rolls back. 4. The committed batch is streamed over the replication protocol to each connected subscriber. > [!NOTE] > **`pgoutput` and `wal2json`**: most subscribers use `pgoutput` because it speaks the Postgres-native binary protocol. Heterogeneous targets (Kafka, Debezium, DataDog, etc.) typically use `wal2json` or a custom plugin that converts the same event stream into JSON. ### Publications and Subscriptions Logical replication is configured declaratively. The publisher declares a **Publication** — a named subset of tables to expose: ```sql -- On the publisher (the London branch) CREATE PUBLICATION cafe_menu FOR TABLE dishes, dish_ingredients; ``` The subscriber connects with a **Subscription**, which creates a background **Apply Worker** that consumes the change stream and re-executes each event locally: ```sql -- On the subscriber (the New York branch) CREATE SUBSCRIPTION ny_branch_sync CONNECTION 'host=london-cafe port=5432 user=replicator dbname=elephant_cafe' PUBLICATION cafe_menu; ``` Apart from the network hop and the configured publication filter, the subscriber's apply path is indistinguishable from a local client running the same `INSERT`/`UPDATE`/`DELETE`. ### Replication Slots: the durability promise Without coordination, the publisher would have no way to know which WAL the subscriber has already consumed. WAL segments are normally recycled aggressively to avoid filling the disk — and a recycled segment is gone forever. If the subscriber is offline when its segment is recycled, replication breaks permanently and the only recovery path is a full re-snapshot. A **Replication Slot** is a small persistent record in `pg_replication_slots` that pins WAL retention to the subscriber's progress. As long as the slot exists, the publisher refuses to recycle any WAL segment newer than the slot's `restart_lsn`, even if the subscriber has been offline for days. ```sql -- Inspect slot state on the publisher SELECT slot_name, slot_type, active, restart_lsn, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots; ``` The `retained_wal` column is the operational metric to watch. If it grows without bound, the subscriber has stopped acknowledging — and the publisher will eventually run out of disk. > [!CAUTION] > **Stale slots are a production hazard**. A subscription that is dropped from the subscriber side without first dropping the slot on the publisher will pin WAL retention indefinitely. The publisher's disk will fill, the cluster will refuse new writes, and recovery requires manually issuing `SELECT pg_drop_replication_slot('...')` on the publisher. Always include slot health in your monitoring dashboard. ### The Cost of Decoding Logical replication is not free CPU on either side: - **Publisher CPU**: every committed transaction is decoded into row events. For write-heavy workloads with wide rows, decoder CPU can be a significant fraction of total backend time. Visible as `LogicalLauncherMain` and `WalSenderMain` activity in `pg_stat_activity`. - **Publisher Memory**: large transactions accumulate in the `logical_decoding_work_mem` buffer until commit. Setting this too low forces premature spill-to-disk; setting it too high risks OOM under concurrent large transactions. - **Subscriber CPU & I/O**: the apply worker is a single backend re-executing the change stream. It is bounded by single-core throughput, which means a write-heavy publisher can outrun a single subscriber. PG16+ supports parallel apply workers per subscription to mitigate this. - **Network**: per-row events are smaller than physical pages, but every change crosses the wire individually. Cross-region replication is dominated by RTT, not bandwidth. ```sql -- Subscriber-side wait events worth watching SELECT pid, wait_event_type, wait_event, state FROM pg_stat_activity WHERE backend_type LIKE 'logical replication%'; ``` If you see `IPC: WalReceiverMain` dominating, the subscriber is bottlenecked on the network. If you see `IO: DataFileWrite`, the apply worker is disk-bound and you need either faster storage or parallel apply. ### When to Use It Logical replication is the right tool when you need any of the following: 1. **Major-version upgrades with zero downtime**: replicate from PG 16 to PG 18, cut over the application, drop the slot. 2. **Selective replication**: ship only the `orders` and `dishes` tables to a reporting cluster, leave the rest of the schema behind. 3. **Cross-cloud or cross-region warehouse feeds**: the subscriber is often a Kafka/Debezium relay that lands rows in S3, Snowflake, or BigQuery. 4. **Multi-master patterns**: with care (and conflict resolution at the application layer), two clusters can publish into each other. It is the wrong tool when you need a strict binary clone (use physical streaming) or when you need synchronous durability (logical apply is asynchronous by design). --- ## 5.4 - Transactions (The Atomic Seal) <img src="assets/arch_atomic_seal.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> A power surge or logic error could interrupt a process halfway through, leaving data in an inconsistent state. To prevent this, Postgres uses **Transactions**—a logical abstraction that ensures a set of actions is an indivisible "Unit of Work." This is the foundation of **Atomicity**: every action in the set succeeds, or the entire set is discarded. ### 🧪 Lab Challenge: The Failed Banquet (Atomicity) **The Request**: "Babu the Elephant wants to order a 'Peanut Extravaganza' and 100 sides of Hay. We need to record the order and the items together. If the items fail, we don't want a 'ghost' order sitting in the system." #### The Naive Solution Sending separate commands without a transaction block. ```sql -- Step 1: Create the order INSERT INTO orders (id, animal_id, order_time, status) VALUES (999, 1, NOW(), 'Pending'); -- Step 2: Add the items (Wait, the stock is empty!) INSERT INTO order_items (order_id, dish_id, quantity) VALUES (999, 50, 100); -- ERROR: quantity exceeds stock constraint ``` #### The Fallout If Step 2 fails, Step 1 has already "happened." You now have an `order` record with ID 999 but no items. The waiter thinks an order is coming, the kitchen sees nothing, and Babu is left hungry. Your data is **inconsistent**. #### The Lazy Fix Wrap the operations in a transaction block using **`BEGIN`** and **`COMMIT`**. ```sql BEGIN; INSERT INTO orders (id, animal_id, order_time, status) VALUES (1000, 1, NOW(), 'Pending'); -- This will fail and trigger a ROLLBACK INSERT INTO order_items (order_id, dish_id, quantity) VALUES (1000, 50, 100); COMMIT; ``` #### The Reward Postgres detects the failure in the second statement. Because the commands are in the same transaction, the engine performs a **ROLLBACK**. It updates the CLOG status for XID 1000 to `10` (Aborted). Even though the first insert succeeded in memory, the engine treats those bytes as invisible. The transaction is fully discarded. --- While the **WAL** records physical changes, the **Commit Log (CLOG)** records the logical status of every transaction. Stored in the `pg_xact` directory, the CLOG allocates exactly **2 bits** for every Transaction ID (XID): * **`00`**: In-Progress * **`01`**: Committed * **`10`**: Aborted * **`11`**: Sub-committed > [!IMPORTANT] > **The Atomicity Checkpoint**: If you remember one thing about transactions, let it be this: **Postgres commits by flipping exactly two bits.** No matter if your transaction updated one row or a billion, the moment those bits in the CLOG flip from `00` to `01`, the entire world sees your changes. Atomicity is not a slow crawl; it is an instantaneous binary switch. This enables instantaneous finality. The engine does not finalize individual tuples during the `COMMIT` command; it simply updates the status in the `pg_xact` file. All subsequent visibility checks rely on these bits to determine if data is visible. ### The WAL Seal: Making it Real The flip in the CLOG is the logical finality, but the **WAL Commit Record** (`XLOG_XACT_COMMIT`) is the physical authority. Before Postgres updates the CLOG, it must first write this commit record to the **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Durable Ledger)|Write-Ahead Log]]** and ensure it is physically `fsync`'d to disk. Once that single record is on disk, the transaction is durable. If the power fails before the CLOG is updated, Postgres will find the commit record in the WAL during recovery and finish the job. Uncommitted data can reach the disk before a transaction completes. Because memory is finite, the **Background Writer** may flush "dirty" pages to disk to free up space in the **Shared Buffer Pool**. This is safe because of the "Write-Ahead" rule. The engine ensures that WAL records—containing the information necessary to undo changes—are persisted to disk *before* the dirty data page. If the transaction rolls back, Postgres uses the WAL to ignore or revert the changes stored on disk. This allows the engine to handle transactions that exceed the size of available RAM. Checking the CLOG for every tuple in a large table scan would be a performance bottleneck. To avoid this, Postgres uses a specialized caching mechanism at the tuple level: **Hint Bits** (`HEAPTUPLE_HINT_BITS`). The first time a process visits a tuple after its parent transaction finishes, it looks up the status in the CLOG and stamps the tuple header with a hint bit: `COMMITTED` or `ABORTED`. Subsequent visitors read the status directly from the tuple header, bypassing the CLOG lookup. > [!NOTE] > **The Lock Footprint**: this is why even a `SELECT` query can occasionally trigger a write to disk. If the engine needs to set a hint bit on a previously clean page, that page becomes Dirty, requiring an eventual flush. The atomic seal is fundamentally two bits in a file. The difference between a transaction that occurred and one that never existed is the state of these bits in `pg_xact`. The rest of the engine's architecture—WAL records, hint bits, and visibility checks—ensures that these two bits represent the consistent state of the database. --- ## 5.5 - Isolation (The Looking Glass Windows) <img src="assets/arch_isolation_windows.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Postgres allows dozens of transactions to execute concurrently on the same tables. The engine must decide, for every query, which modifications from concurrent transactions are visible. This decision is governed by the **Isolation Level**, which is implemented through a structure called a **Snapshot**. Isolation levels provide the architectural trade-off between concurrency and consistency. The same table, read by two transactions simultaneously, can return different consistent results without compromising data integrity. ### The Anatomy of a Snapshot A Snapshot defines the "visibility horizon" for a transaction. When the engine takes a snapshot, it captures three critical Transaction IDs: * **`xmin`**: The lowest XID that was still active when the snapshot was taken. All transactions older than `xmin` are guaranteed to be committed or aborted. * **`xmax`**: The next XID to be assigned. All transactions with XIDs greater than or equal to `xmax` are invisible to this snapshot. * **`xip_list`**: The list of XIDs that were active (in-progress) at the moment the snapshot was taken. By comparing these values to a **Tuple's** header (**`t_xmin`** and **`t_xmax`**), the engine can determine visibility in microseconds. #### 1. Read Committed (The Default Level) This is the default isolation level. In this mode, Postgres takes a **new snapshot for every query** within a transaction. ### 🧪 Lab Challenge: The Double-Order Race (Read Committed) **The Request**: "Two waiters are looking at the price of 'Saffron'. We need to see what happens if one updates it while the other is still looking." #### The Setup (Two Sessions) Open two terminal windows (`psql`). **Session A (Manager)**: ```sql BEGIN; -- Step 1: Check the price SELECT price FROM ingredients WHERE name = 'Saffron'; -- Returns 50.00 ``` **Session B (Supplier)**: ```sql BEGIN; -- Step 2: Update the price UPDATE ingredients SET price = 99.00 WHERE name = 'Saffron'; COMMIT; ``` #### The Fallout Go back to **Session A** and run the query again: ```sql SELECT price FROM ingredients WHERE name = 'Saffron'; -- Returns 99.00! ``` In **Read Committed** mode, Session A saw the supplier's change mid-transaction because it took a fresh snapshot for the second query. This phenomenon is known as a **Non-repeatable Read**. --- ### 🧪 Lab Challenge: The Waiter's Paradox (Repeatable Read) **The Request**: "Ensure that once a manager starts a stock audit, the prices they see remain frozen in time, even if a supplier updates them." #### The Lazy Fix Use the **Repeatable Read** isolation level. **Session A (Manager)**: ```sql BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT price FROM ingredients WHERE name = 'Saffron'; -- Returns 50.00 ``` **Session B (Supplier)**: ```sql UPDATE ingredients SET price = 150.00 WHERE name = 'Saffron'; ``` #### The Reward Go back to **Session A** and check again: ```sql SELECT price FROM ingredients WHERE name = 'Saffron'; -- Returns 50.00 ``` Session A is now using a **Consistent Snapshot**. It uses the same snapshot from the start of the transaction, ignoring any modifications committed by concurrent transactions. #### The Conflict Now try to update the price in **Session A**: ```sql UPDATE ingredients SET price = 125.00 WHERE name = 'Saffron'; -- ERROR: could not serialize access due to concurrent update ``` Postgres protects the integrity of the data. Since the record was changed by another session after Session A's snapshot was taken, the engine refuses to let Session A overwrite "invisible" history. It forces a rollback. --- ### 🧪 Lab Challenge: The Chef's Duty (Serializable) **The Request**: "At least one chef must always be on call. We have two chefs. If two managers try to take one chef off-call at the same time, we must not end up with zero chefs." #### The Naive Solution (Repeatable Read) In Repeatable Read, both managers would see `count = 2`. Both would update their respective chef to `off_call = false`. Since they are updating *different* rows, there is no row-level lock conflict. Both would succeed, leaving zero chefs. This is **Write Skew**. #### The Lazy Fix Use **Serializable**. ```sql BEGIN ISOLATION LEVEL SERIALIZABLE; SELECT count(*) FROM staff WHERE on_call; -- Returns 2 UPDATE staff SET on_call = false WHERE name = 'Chef A'; COMMIT; ``` #### The Reward If another manager does the same for 'Chef B' at the same time, Postgres tracks the **Predicate Dependency**. The engine realizes that the second transaction's logic was based on data that the first transaction modified. Postgres throws a **Serialization Error** on the second commit to prevent the anomaly. > [!NOTE] > **Snapshot Persistence and Vacuum**: Long-running transactions in `REPEATABLE READ` or `SERIALIZABLE` mode prevent `VACUUM` from cleaning up old tuple versions. Because the snapshot must remain consistent, the engine cannot reclaim space until the transaction closes and the "xmin horizon" can advance. --- ### Choosing Your Window You can tell the engine exactly which window you want for your next transaction: ```sql SET TRANSACTION ISOLATION LEVEL READ COMMITTED; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; ``` Stricter isolation levels require more system resources for state tracking. They also increase the likelihood of serialization failures, which require the application to retry the transaction. --- ## 5.6 - Transaction ID Wraparound (The Infinite Calendar) <img src="assets/arch_xid_wraparound.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Postgres is designed to run forever — but the integer it uses to label transactions is not. Every transaction gets a 32-bit **Transaction ID (XID)**, and a 32-bit counter only goes up to about 4.2 billion. A long-lived, high-throughput database will, eventually, run out. The engine's solution is to pretend it has more numbers than it does. Think of the XID space as an **Infinite Calendar**: a wheel of dates that loops back to January 1 every four billion transactions, where "old" and "new" are defined by which way you walk around the wheel rather than by absolute position. The architectural payoff is that the engine never has to widen the XID type. The architectural cost is that someone, somewhere, has to rewrite very old rows to keep them on the right side of the wheel — and if that maintenance ever stops, the database tears itself apart. ### The Visibility Math To decide if transaction A came before or after transaction B, Postgres subtracts the two IDs and casts the result to a **signed 32-bit integer**: - If the result is **negative**, A is in the past. - If the result is **positive**, A is in the future. This means Postgres can only distinguish about **2 billion** transactions into the past. Beyond that, the math flips: a transaction from seventy years ago suddenly looks like a transaction from the distant future, and rows that should be visible become invisible (or vice versa). That's the actual failure mode of XID wraparound — it isn't an overflow exception, it's silent corruption of visibility. > [!NOTE] > **Why not just use 64-bit XIDs?** Two reasons. First, XIDs are stored *in every tuple header* (`xmin` and `xmax`), so doubling their width adds 8 bytes to every row in every table. Second, the engine maintains a per-cluster bitmap (the **CLOG**) that records the commit/abort status of every XID — widening to 64 bits would balloon the CLOG by the same factor. Postgres pays the freeze tax instead. There's ongoing core work on a 64-bit-XID variant (e.g., the `xid8` type for some session APIs, and out-of-tree forks that switch tuple headers entirely), but the in-tree storage format remains 32-bit. ### The FREEZE Ritual To prevent the wheel from biting itself, the engine performs a ritual called **FREEZE**. When a tuple is old enough that it is guaranteed to be visible to every conceivable future transaction, **Autovacuum** sets the `HEAP_XMIN_FROZEN` flag in the tuple's `t_infomask` and (depending on Postgres version) rewrites `xmin` to the reserved value **`2` (`FrozenTransactionId`)**. The reserved XID `2` is hard-coded in the source as older than every possible non-reserved XID. Once a row is frozen, it is permanently visible to every future transaction regardless of how many times the calendar has wrapped since. The engine tracks freezing progress at three levels: | Level | Catalog | What it records | | :------- | :---------------------------- | :----------------------------------------------------------- | | Cluster | `pg_control` → `oldestXid` | Globally oldest unfrozen XID across all databases. | | Database | `pg_database.datfrozenxid` | Oldest unfrozen XID anywhere in the database. | | Table | `pg_class.relfrozenxid` | Oldest unfrozen XID in this relation. Drives autovacuum priority. | Each `VACUUM (FREEZE)` advances `relfrozenxid` for the table; the database's `datfrozenxid` is the minimum across its tables; the cluster's `oldestXid` is the minimum across databases. The wraparound horizon is measured against `oldestXid`. #### Monitoring ```sql -- Per-database distance to the wraparound horizon SELECT datname, age(datfrozenxid) AS xid_age, 2_000_000_000 - age(datfrozenxid) AS xids_remaining FROM pg_database ORDER BY xid_age DESC; ``` ```sql -- Per-table view: which relations does autovacuum need to chase? SELECT n.nspname || '.' || c.relname AS relation, age(c.relfrozenxid) AS xid_age, pg_size_pretty(pg_relation_size(c.oid)) AS size FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE c.relkind IN ('r','m','t') ORDER BY age(c.relfrozenxid) DESC LIMIT 10; ``` #### The Threshold Ladder Postgres reacts at three escalating thresholds, each tunable via GUC: | Threshold (default) | Engine behavior | | :---------------------------------------- | :----------------------------------------------------------------------------------------------- | | `vacuum_freeze_min_age` (50M) | Tuples older than this become eligible for freezing during ordinary autovacuum. | | `vacuum_freeze_table_age` (150M) | Autovacuum upgrades from "scan only the visibility-map-marked pages" to a full-table scan. | | `autovacuum_freeze_max_age` (200M) | A wraparound-prevention vacuum is triggered, regardless of dead-tuple thresholds. Cannot be disabled per-table. | | Hard limit: **age ≥ 2,000,000,000** | Engine refuses new XIDs cluster-wide, returns `database is not accepting commands` to writes, allows superuser-only read-only sessions until manual `VACUUM` advances the horizon. | --- ### The Operational Friction: When Safety Stalls #### Why Wraparound Vacuums Get Stuck The single most common reason a healthy autovacuum stops making progress on `relfrozenxid` is a **long-running transaction holding an old snapshot**. A transaction's `xmin` pins the freeze horizon: autovacuum will not freeze any tuple newer than the oldest live snapshot. If a session has been sitting idle in a transaction for hours, every table's `relfrozenxid` is stuck at that session's start. Diagnose with: ```sql SELECT pid, age(backend_xmin) AS xmin_age, state, query_start, wait_event, query FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY xmin_age DESC LIMIT 5; ``` The fix is rarely "tune autovacuum harder" — it is to find and terminate the transaction that is holding the snapshot. Common offenders are: idle-in-transaction app connections, abandoned `pg_dump` runs, and replication slots whose subscribers have stopped advancing. #### Recovering From the Panic Zone If the cluster crosses the 2-billion threshold, single-user mode is the escape route: ```bash # Stop the cluster, then run from the postgres OS user: postgres --single -D /var/lib/postgresql/18/data postgres # In single-user mode, run a database-wide freeze: backend> VACUUM (FREEZE, VERBOSE); ``` Single-user mode bypasses the safety check that refuses connections and allows `VACUUM` to advance `datfrozenxid`. Once the database has been frozen below the 2-billion mark, restart normally. Wraparound is rare in correctly-monitored clusters, but for massive, high-velocity databases it is the ultimate end boss. Keep your autovacuum healthy, watch for long-running transactions and stale replication slots, and the calendar keeps turning quietly in the background. By mastering WAL configuration and transaction discipline, you ensure the database's durability guarantees hold under any failure scenario. --- --- ## 5.7 - Summary (Durability & Transactions) > Memory is fleeting, but the log is forever. The Write-Ahead Log is the single source of truth that survives a crash. If a transaction is not in the log, it did not happen. If it is in the log, the engine will make it real. <div style="page-break-after: always;"></div>