# Chapter 5: Durability & Transactions
## Chapter 5: Safety Without Sweating (The Write-Ahead Log)
![[assets/chap_4_safety.png|450]]
Postgres is built on a simple, uncompromising trust: once a transaction is confirmed, the data must survive any system failure. This reliability is the engine's primary currency. If the system loses power or the operating system crashes, the promises made to the user must remain etched in reality.
To understand how Postgres maintains this trust, we must first look at the most intuitive—and most flawed—way to persist data.
### The Naive Model: Direct Page Updates
In a naive world, every time you update a **Tuple** (a record's physical suitcase), Postgres would immediately find the relevant **8KB [[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Page]]** (the shipping container) on disk and overwrite it with the new data.
It is an obvious strategy, but in production, it is an architectural catastrophe.
### Why Direct Persistence Fails
There are three fundamental reasons why Postgres cannot update its data files directly for every change:
1. **The Random I/O Speed Trap**: Operating systems and storage hardware are significantly faster at writing one continuous stream of data (**Sequential I/O**) than jumping between isolated locations on disk (**Random I/O**). Erasing and re-writing an entire 8KB page just to change a single 10-byte tuple is like repainting an entire warehouse because of a smudge on a single crate.
2. **The Torn Write Risk**: If the system loses power mid-write, an 8KB page may only be partially updated. The result is a **Torn Write**, where half the page represents the old state and half represents the new. Without a backup, this results in permanent physical corruption.
3. **The Atomic Failure**: Complex operations (like money transfers) often span multiple rows in different pages. If the system crashes after the first page is written but before the second, the database is left in an impossible, inconsistent state.
### The Solution: The Write-Ahead Log (WAL)
To solve this, Postgres never writes directly to its data files during a transaction. Instead, it maintains a sequential, append-only **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Pocket Diary)|Write-Ahead Log (WAL)]]**, or what we conceptually call the **Pocket Diary**.
Instead of moving containers in the cellar, the engine simply scribbles a tiny, append-only note in its diary: *"At LSN 0/1A2B, cross out Sarah's name and write Axolotl."* Because this log is sequential, it is incredibly fast to write. As soon as this small record is flushed to disk (fsync'd), Postgres considers the transaction committed.
This is the foundation of **Durability**. If the power dies, Postgres wakes up, reads the WAL from its last known **Checkpoint**, and re-applies the records into its memory. The actual 8KB pages in the main data files (the "Frozen Pantry") are updated lazily in the background without blocking the user.
### The Atomicity Guarantee (Transactions)
The WAL also enables the **[[Manuscript/05 - Durability & Transactions/5.4 - Transactions (The Atomic Seal)|Atomic Seal (Transaction)]]**. This is the guarantee of **Atomicity**: either every action in a session is finalized, or none of them are. By using the diary to track the beginning and end of a related set of changes, the engine can simply "rewind" or ignore any partial promises it finds in the ledger after a crash.
Postgres ensures that a scribble in a diary is as durable as a bolt in a cabinet. This allows the engine to be fast while remaining indestructible—giving us the peace of mind to trust our data to the massive, sometimes chaotic clouds where our modern Cafe lives.
---
## 5.1 The Pocket Diary (WAL & fsync)
![[assets/arch_wal_diary.png|450]]
In the database, the **Disk** (the Filing Cabinet) is the ultimate destination for data. But as we've seen, writing directly to disk is slow and physically hazardous.
Postgres’s primary mechanism for ensuring durability without sacrificing performance is the **[[Manuscript/05 - Durability & Transactions/5.0 - Write-Ahead Log (Safety Without Sweating)|Write-Ahead Log (WAL)]]**.
### The Sequential Scribble
When Postgres process a change—whether an `INSERT`, `UPDATE`, or `DELETE`—it does not immediately update the primary data files. Instead, it generates a record in the WAL, scribbling exactly what it is about to do.
Every individual change is encapsulated in an **`XLogRecord`**. This is a structured binary header followed by a payload that describes the physical delta.
### Anatomy of an `XLogRecord`
```text
┌───────────────────────────────────────────────────────────┐
│ XLogRecord (A Single Scribble) │
│ │
│ xl_tot_len │ xl_xid │ xl_prev │ xl_rmid │
├──────────────┴─────────────┴─────────────┴─────────────────┤
│ PAYLOAD DATA │
└───────────────────────────────────────────────────────────┘
```
- **`xl_tot_len`**: The total size of the record (including the header).
- **`xl_xid`**: The Transaction ID to which this change belongs.
- **`xl_prev`**: A 64-bit pointer back to the previous record's LSN, ensuring the log is a contiguous chain.
- **`xl_rmid`**: The **Resource Manager ID**. This identifies which specialized kernel (Heap, B-Tree, GIN) knows how to interpret the payload.
- **Payload**: The raw binary "delta"—e.g., "In Page 88, at Offset 24, insert these 40 bytes."
### The Log Sequence Number (`LSN`)
Every WAL record is stamped with a **Log Sequence Number (LSN)**. You can think of the LSN as an absolute coordinate system for the Postgres universe.
It is a 64-bit integer representing a specific byte-offset from the beginning of the database's log history. Because Postgres only ever appends to the WAL, the LSN is permanently increasing. It is the heartbeat of the engine.
```sql
-- Check the current heartbeat
SELECT pg_current_wal_lsn(); -- Result: 0/16A5E88
-- Make a tiny promise (Insert a peanut)
INSERT INTO ingredients (name, category) VALUES ('Peanut', 'Nut');
-- Check the heartbeat again
SELECT pg_current_wal_lsn(); -- Result: 0/16A5FA0 (It moved!)
```
### The Hardware Handshake (`fsync`)
A record is only safe once it is physically persisted to the disk. If the engine writes the record to the **WAL Buffer** (the mental scratchpad) and the power fails, the change is lost.
To prevent this, Postgres uses **`fsync()`**—the **Hardware Handshake**. Before a transaction is considered "committed," the WAL Writer process issues an `fsync()` command to the **Operating System Kernel**, essentially saying: *"Do not tell me you are done until these bits are physically etched into the storage media."*
The database engine then stops everything. He waits for the basement worker (the Operating System) to yell back: *"THE BITS HAVE LANDED!"*
> [!TIP]
> **Living Dangerously**: If you are willing to risk a fraction of a second of data for a massive throughput boost, you can perform a "Loose Handshake" by setting `synchronous_commit = off`. See **[[Manuscript/05 - Durability & Transactions/5.1.1 - Commit Tuning (The Loose Handshake)|5.1.1 The Loose Handshake]]** for the technical details.
> [!TIP]
> **Group Commit**: To avoid the massive overhead of a hardware handshake for every single order, Postgres uses **Group Commit**.
The WAL writer can wait tiny fractions of a second (`commit_delay`) to aggregate multiple concurrent users' records, flushing them all in a single `fsync()` operation.
### Enacting the Promises: Checkpoints and Dirty Pages
If Postgres only writes to the WAL, when does the actual data file—the **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Page]]**—get updated? This is managed by two distinct background processes.
#### 1. The Immediate Edit (Dirty Pages)
While Postgres is writing to the WAL, it *also* updates the page currently sitting in **[[Manuscript/06 - Resource Management & Processes/6.2 - Shared Buffers (The Warming Rack)|Shared Buffers]]** (the memory prep table). A page in memory that differs from the one on disk is known as a **Dirty Page**.
The database gives you a "Success" message as soon as the WAL is safe, even if the main data file on disk is still out of date.
#### 2. The Checkpointer
Every few minutes (or after enough WAL data has accumulated), the **Checkpointer** process starts a **Checkpoint**. Its job is to guarantee a specific point in time where the data files are synchronized with the WAL.
1. It identifies all **Dirty Pages** in memory.
2. It flushes them to disk.
3. It writes a special **Checkpoint Record** to the WAL.
Once a Checkpoint is complete, Postgres knows it no longer needs the WAL records older than that point for a standard recovery.
#### 3. The Background Writer
The **Background Writer** is the unsung hero of the engine. While the Checkpointer flushes everything in a massive wave, the Background Writer scans for dirty pages constantly and flushes them to disk in small, quiet batches. Its goal is to ensure there are always "clean" pages available in memory so that when a user needs to load new data, they don't have to wait for an expensive disk write.
### Full Page Writes: The Polaroid Backup
There is a danger: a system crash mid-write could leave an 8KB Page half-updated. Because most hardware writes in 512-byte or 4KB chunks, this results in a **Torn Page**—a corrupted block that cannot be recovered by standard means.
To avoid this, Postgres uses **Full Page Writes** (the Polaroid). The very first time a page is modified after a Checkpoint, Postgres writes the *entire* 8KB page into the WAL before writing the delta.
If a torn page is discovered during recovery, Postgres ignores the corrupted data file and simply re-applies the "Polaroid" image from the WAL.
> [!IMPORTANT]
> **The Truth Hierarchy**:
> 1. **The WAL**: The immediate, absolute Truth.
> 2. **Shared Buffers**: The active, in-memory Truth.
> 3. **Data Files**: The deferred, eventual Truth.
---
## 5.1.1 The Loose Handshake (Commit Tuning)
![[assets/arch_wal_tweak.png|450]]
In Postgres, safety is the most expensive thing you can buy. As we’ve seen, the **`fsync()`** handshake—waiting for the physical disk to confirm it has etched every record—is the primary bottleneck for write-heavy workloads.
If your application needs to handle thousands of orders per second, you must decide how "tight" your handshake needs to be.
### The Loose Handshake: `synchronous_commit`
The most impactful WAL tweak is the **`synchronous_commit`** setting. By default, it is set to `on`, meaning Postgres will not tell a client "Done" until the WAL record is guaranteed to be on physical storage.
But what if you value speed over absolute durability for specific operations?
### `synchronous_commit = off`
When you set this to `off`, you are performing a **Loose Handshake**. Postgres will commit the transaction in memory, give the user a thumbs up, and continue—but it will *not* wait for the WAL record to be `fsync`'d to disk.
The **WAL Writer** process will eventually flush the data in the background (usually within 3x `wal_writer_delay`, or about 600ms by default).
- **The Reward**: Massive throughput gains. On standard hardware, switching to `off` can often triple the number of transactions per second.
- **The Risk**: If the server crashes or the OS fails before the background flush happens, you can lose up to ~1.6 seconds of data.
- **The Safety Net**: Crucially, setting this to `off` **cannot cause data corruption**. The database remains physically consistent because the WAL records are still written in the correct order—you only risk losing the most recent history.
> [!TIP]
> You don't have to change this globally. You can set `synchronous_commit = off` for a single session or even a single transaction:
> ```sql
> BEGIN;
> SET LOCAL synchronous_commit = off;
> INSERT INTO supply_deliveries (...) VALUES (...);
> COMMIT;
> ```
### The Group Rush: Tuning Group Commit
If you choose to keep safety `on`, you can still optimize performance by helping Postgres aggregate its work. This is managed by two settings:
#### 1. `commit_delay`
This tells the engine to wait for a specific number of microseconds before performing an `fsync()`.
- **The Logic**: It’s like a waiter waiting an extra 5 seconds at the kitchen door to see if any other orders are ready so they only have to make one trip.
- **Value**: Defaults to `0`. Setting it to something like `1000` (1ms) can improve throughput during high-concurrency bursts.
#### 2. `commit_siblings`
This is a guardrail for `commit_delay`. It specifies how many *other* concurrent transactions must be active before `commit_delay` is invoked.
- **The Logic**: There’s no point in waiting if you’re the only person in the Cafe. If there are fewer than `commit_siblings` active transactions, the engine ignores the delay and flushes immediately.
### The Scribble Memory: `wal_buffers`
Before WAL records hit the disk, they sit in the **WAL Buffers** (the mental scratchpad).
By default, this is set to `-1`, which tells Postgres to automatically allocate 1/32 of your **`shared_buffers`** (up to 16MB). For almost all workloads, this auto-tuning is perfect. However, if you are seeing wait events like `WALBuffer-Full`, increasing this can prevent the engine from stalling while waiting for the WAL writer to catch up.
> [!IMPORTANT]
> **Summary of Commit Levels**:
> - **`on`**: The default. Wait for local disk flush.
> - **`off`**: Do not wait for disk. Best for performance where minor data loss is acceptable.
> - **`local`**: Used in replication (see Chapter 8). Wait for local flush but not replica flush.
> - **`remote_write`**: Wait for replica to receive data but not necessarily flush it.
> - **`remote_apply`**: Wait for replica to physically apply the change to its own files.
---
## 5.2 The Recovery Parade (Crash Recovery)
![[assets/arch_recovery_parade.png|450]]
Imagine a sudden, violent thunderclap. The power to the database flickers and dies.
Postgres—the meticulous database engine—who was in the middle of three different tea-parties, falls asleep
instantly.
When the power returns and Postgres wakes up, it has **amnesia**. It doesn't
remember what was in its head, what was on its desk (Memory), or which suitcases
it was currently moving. "Where am I? Who am I? And where is my jam?" he
wonders.
### The Morning After
The Filing Cabinet (the Frozen Pantry) is now in a "dirty" state. Some suitcases are halfway moved, some are missing, and the whole database state is a mess. But Postgres doesn't panic. It reaches into its pocket and pulls out the absolute ground-truth of the database: the **`pg_control`** file.
### The Survival Blueprint (`pg_control`)
The `pg_control` file is the very first thing the engine reads upon waking up. It is a tiny, high-stakes binary file that contains the **Control Data**. It tells Postgres:
- **`state`**: Was the database shut down gracefully, or in a panic?
- **`checkPoint`**: What is the **LSN** of the last successful checkpoint?
- **`minRecoveryPoint`**: How far into the Logbook must I read before the database is consistent?
If `state` is anything other than `DB_SHUTDOWNED`, the database knows a crash occurred and immediately triggers the **Recovery Parade**.
The diary survived the storm because of the `fsync` handshake we talked about in
**[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Pocket Diary)|The Pocket Diary]]**.
### The Redo Log
Dinner service must continue, the show must go on! Postgres begins the **Recovery Parade**. It starts at the very last "safe point"—the last **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Pocket Diary)#The Two-Step Edit (The Prep Station)|Checkpoint]]**—and reads every single scribble in the diary from that point forward.
Because every scribble has a unique **Log Sequence Number (LSN)**, the database engine
never gets confused about the order of events. Postgres pulls the first `XLogRecord`
out of the 16MB segment file, checks its LSN, and walks over to the Shipping
Container in the Frozen Pantry basement.
When the database engine opens the container, it checks the `pd_lsn` stamped on the wobbly
control panel. If the container's LSN is _older_ than the scribble's LSN, it
knows the cabinet missed the update when the power went out!
_"I changed Row 5 to Red..."_ -> Postgres checks Row 5. If it's still Blue, it makes it Red. The meticulous database engine then updates the container's `pd_lsn` to match the scribble, proving it did the work. _"I moved Suitcase 10 to Shelf B..."_ -> Postgres moves the suitcase. It’s like a playback of a very confusing movie!
Postgres "redoes" every single successful action that was confirmed before the power went out. As for the tasks that were **not** finished? The database engine simply ignores them!
Because of the **[[Manuscript/05 - Durability & Transactions/5.4 - Transactions (The Atomic Seal)|Commit Log (CLOG)]]**, any data from a crashed session is marked as "In-Progress" or "Aborted." Even if the records physically exist on disk, the engine's visibility rules (MVCC) will treat them as if they never happened. The database returns to a perfectly consistent state, as if the uncommitted actions never existed.
### The Miracle of Consistency
This parade is why Postgres is so famously reliable. You can pull the plug, kick
the server, or have a catastrophic kernel crash, and as long as the Pocket Diary
is intact, Postgres will eventually rebuild the world exactly as it was
promised to be.
This recovery process happens automatically every time Postgres starts up. If
you see a message saying "database system was not shut down cleanly; automatic
recovery in progress," don't worry—that's just Postgres finishing its
morning reading.
---
## 5.3 The Town Crier (Logical Replication)
![[assets/arch_logical_replication_crier.png|450]]
So far, we’ve talked about a single Postgres instance in a single database. But what if the database is too small? What if you need a Postgres instance in New York to know exactly what the Postgres instance at the main branch in London is doing?
### The Crier's Scroll
Postgres has a way of sharing its Pocket Diary with the world. This is called **Logical Replication**.
To understand why this is special, you must contrast it with **Physical Replication**.
- **Physical Replication (Streaming)**: This is like **shipping the literal bricks of the building**. You send the exact 8KB shipping containers (Pages) via the **WAL Stream**. It’s incredibly fast, but both databases must be identical binary clones.
- **Logical Replication**: This is like **shipping the recipe**. You don't send the bricks; you send the *intent* of the row change. "In the `dishes` table, change the price of 'Nut Bread' to 15.00."
### The Crier's Scroll: Logical Decoding
To turn binary scribbles into human-readable news, Postgres uses **Logical Decoding**.
1. The **Walsender** process reads a record from the WAL.
2. It passes that record to a **Decoding Plugin** (like `pgoutput`).
3. The plugin translates the internal binary delta into a **Logical Change Stream** that can be sent over the network.
Because you are only shipping the _meaning_ of the changes, the other database can have a completely different architecture or even a newer version of Postgres.
### The Listener's Notebook (Subscription)
In New York, another Postgres instance—acting as a **Subscription**—is ready to listen for news from across the ocean. He has a dedicated worker sitting on his roof with a telescope, listening for the London Crier.
```sql
-- New York branch subscribing to the London menu
CREATE SUBSCRIPTION ny_branch_sync
CONNECTION 'host=london-cafe port=5432 user=replicator password=peanuts dbname=elephant_cafe'
PUBLICATION cafe_news;
```
Every time the worker hears a change—_"London just added a new customer!"_—he writes it down and hands it to the New York instance, which immediately performs the same action in its own database. It’s like a worldwide game of Telephone, but one where everyone actually listens!
This focus on _meaning_ over _bricks_ allows New York to have a completely different filing cabinet layout, or even run a newer version of Postgres, while still keeping up with the news from London. It’s like translating a recipe from English to French; as long as the cake comes out the same, the kitchen layout doesn't matter!
Postgres is very considerate. He doesn't want the New York instance to miss any news! So, he uses a **Replication Slot**—basically a **Heavy Metal Bookmark** stored in the `pg_replslot` directory.
The slot tracks the **`restart_lsn`**: the exact point in the Pocket Diary that New York has finished reading.
As long as a slot is active, the London instance—holding the publication—is legally forbidden from deleting old diary pages (WAL segments) that are newer than that LSN. Even if New York is offline for days, Postgres will keep stacking its old diary volumes in the pantry until the disk is literally overflowing, waiting for the subscriber to catch up.
### The Cost of Shouting (Physical Resources)
Even the Town Crier needs to be paid in peanuts! Logical replication isn't "free" labor for the database engine; it consumes real resources in both databases.
### On the Source (Publication)
* **CPU (Decoding Sweat)**: The London publisher has to hire a **Logical Decoder** worker. This worker spends all day reading the Pocket Diary and translating "binary scribbles" into "human news." This takes extra CPU cycles every time you write data.
* **Memory**: The Decoder needs a small "Holding Pen" (Buffer) in memory to organize changes before shouting them. If your transactions are huge, this pen might overflow to the Frozen Pantry!
* **Disk (The Storage Hazard)**: As we saw with the **Heavy Metal Bookmark** (Replication Slot), if the news isn't delivered, the database engine's pockets fill with WAL files. Because Postgres is legally obligated to keep the notes until they're read, a "Stale Slot" can eventually crash the entire database by filling up the disk!
### On the Target (Subscription)
* **CPU & I/O (The Busy Scribe)**: The New York subscriber has to actually *perform* the work London told him about. He’s re-running the inserts and updates, which means it’s sweating just as much as the London instance did.
* **Network (The Messenger)**: Shouting across the ocean takes bandwidth. If you're replicating a billion-row table, your "telescope worker" is going to be very, very busy.
> [!TIP]
> Always monitor your **[[Manuscript/07 - Wait Events & Concurrency/7.0 - Workloads & Locking (The Waiting Game)|Wait Events]]** on the subscriber! If you see `WalReceiverWait`, your network might be too slow for the news.
> [!WARNING]
> If the New York branch stays offline too long, the database engine's pocket will get stuffed with millions of old diary pages! A "Stale Slot" can eventually fill up the entire database's storage (the Frozen Pantry). Remember to check your bookmarks:
```sql
-- Checking who is currently reading the news
SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
```
This "Logical" connection is the secret sauce for moving data between different systems, upgrading your database without downtime, and eventually building the massive, interconnected networks we'll explore in **[[Manuscript/08 - Distributed Scaling & Clouds/8.0 - Distributed Storage (The Elephant in the Clouds)|Chapter 8: The Cloud Scales]]**.
---
## 5.4 Transactions (The Atomic Seal)
![[assets/arch_atomic_seal.png|450]]
In the database, errors are inevitable. A network failure, a power surge, or a logic error can interrupt a process halfway through. Without a guarantee of integrity, these failures would leave your data in a mangled, inconsistent state—a digital void where money vanishes and orders exist without customers.
To prevent this, Postgres uses **Transactions**—a formal **Atomic Seal** that a set of actions is one single, indivisible "Unit of Work." This is the foundation of **Atomicity**: either every action in the set succeeds, or none of them do.
### The Atomic Flip: The Commit Log (CLOG)
While the **WAL** records the physical details of *what* changed, the **Commit Log (CLOG)** records the logical status of every transaction. Physically located in the `pg_xact` directory, the CLOG allocates exactly **2 bits** for every Transaction ID (XID):
* **`00`**: In-Progress
* **`01`**: Committed
* **`10`**: Aborted
* **`11`**: Sub-committed
This is how Postgres achieves instantaneous atomicity. Even if a transaction updated millions of **Tuples** across thousands of **Pages**, the engine doesn't finalize them one by one. Instead, it simply flips the 2 bits in the CLOG from `00` to `01`. At that exact millisecond, every change made by that XID becomes logically visible to the entire world.
### The WAL Seal: Making it Real
The flip in the CLOG is the logical finality, but the **WAL Commit Record** (`XLOG_XACT_COMMIT`) is the physical authority. Before Postgres updates the CLOG, it must first write this commit record to the **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Pocket Diary)|Write-Ahead Log]]** and ensure it is physically `fsync`'d to disk.
Once that single record is on disk, the transaction is durable. If the power fails before the CLOG is updated, Postgres will find the commit record in the WAL during recovery and finish the job.
### Early Persistence: Moving Data to the Cellar
A common misconception is that uncommitted data never reaches the disk. In reality, Postgres must manage its memory (the **Shared Buffer Pool**) carefully.
Because memory is finite, the **Background Writer** is allowed to move a "dirty" page containing uncommitted data down to the **Disk** (the cellar) to make room for new guest orders.
* **Why it's safe**: Because of the "Write-Ahead" rule, the engine ensures that the WAL records—containing the "Undo" information—are physically on disk *before* the dirty page itself is flushed. If the transaction eventually rolls back, Postgres has everything it needs in the WAL to "undo" the changes that were moved to the cellar early. This separation allows the engine to handle transactions that are significantly larger than the available RAM.
### The Physical Stamp: Hint Bits
Checking the CLOG for every single tuple in a 10-million-row scan would be a performance disaster. To avoid this, Postgres uses **Hint Bits** (`HEAPTUPLE_HINT_BITS`).
The first time any process visits a tuple after its parent transaction has finished, it looks up the status in the CLOG and immediately **stamps** the tuple's header with a hint bit: `COMMITTED` or `ABORTED`.
Subsequent visitors don't need to check the CLOG; they simply look at the stamp on the suitcase. This "caching" of the transaction status is what allows Postgres to maintain high performance even with massive histories.
> [!NOTE]
> **The Lock Footprint**: This is why even a `SELECT` query can occasionally trigger a write to disk. If the engine needs to "stamp" a hint bit on a page that was previously clean, it marks the page as **Dirty**, requiring an eventual flush to Disk.
---
## 5.5 The Looking Glass Windows (Isolation Levels)
![[assets/arch_isolation_windows.png|450]]
Now that we understand the physical mechanism of the [[Manuscript/05 - Durability & Transactions/5.4 - Transactions (The Atomic Seal)|Atomic Seal]], we need to discuss how Postgres chooses which changes are visible to which users.
When you enter a transaction, Postgres must decide exactly how much of the table heap you are allowed to see. This is the **Isolation Level**, and it is implemented using a technical structure called a **Snapshot**.
### The Anatomy of a Snapshot
A Snapshot is the engine's way of defining "The Present." When a snapshot is taken, it captures three critical numbers:
* **`xmin`**: The lowest XID that was still active when the snapshot was taken. Anything older than this is guaranteed to be finished (committed or aborted).
* **`xmax`**: The next XID to be assigned. Anything newer than this is guaranteed to be invisible.
* **`xip_list`**: A list of specific XIDs that were active (in-progress) at the moment the snapshot was taken.
By comparing these three numbers to a **Tuple's** header (**`t_xmin`** and **`t_xmax`**), the engine can decide in microseconds if a suitcase is visible to you.
#### 1. Read Committed (The Standard Window)
This is the default isolation level. In this mode, Postgres takes a **new snapshot for every single query** you run.
### Example: The Blink Test
Imagine you are a manager checking the stock of Saffron. While you are looking, another process updates the price.
| Time | Session A (Manager) | Session B (Supplier) |
| :--- | :--- | :--- |
| T1 | `BEGIN;` | |
| T2 | `SELECT price FROM ingredients WHERE id = 5;` <br> -- Returns 50.00 | |
| T3 | | `UPDATE ingredients SET price = 99.00 WHERE id = 5;` |
| T4 | | `COMMIT;` |
| T5 | `SELECT price FROM ingredients WHERE id = 5;` <br> **-- Returns 99.00** | |
| T6 | `COMMIT;` | |
In **Read Committed**, Session A "blinked" and the world changed between its first and second query. Because T5 took a fresh snapshot, it saw Session B's committed work. This is the **Non-repeatable Read**.
#### 2. Repeatable Read (The Magic Painting)
If you need consistency across multiple queries, you ask for **Repeatable Read**. In this mode, the engine takes only **one snapshot** at the start of your transaction and reuses it until you `COMMIT`.
### Example: The Stable View
Let's run the same "Blink Test" in Repeatable Read:
| Time | Session A (Manager) | Session B (Supplier) |
| :--- | :--- | :--- |
| T1 | `BEGIN ISOLATION LEVEL REPEATABLE READ;` | |
| T2 | `SELECT price FROM ingredients WHERE id = 5;` <br> -- Returns 50.00 | |
| T3 | | `UPDATE ingredients SET price = 99.00 WHERE id = 5;` |
| T4 | | `COMMIT;` |
| T5 | `SELECT price FROM ingredients WHERE id = 5;` <br> **-- Returns 50.00** | |
| T6 | `COMMIT;` | |
Session A’s view is now a "Magic Painting." Even though the supplier committed a change at T4, Session A continues to use its T1 snapshot. It won't see the new price until it starts a fresh transaction.
> [!CAUTION]
> **The Serialization Error**: If Session A tried to `UPDATE` that Saffron record at T5, Postgres would raise an error: `could not serialize access due to concurrent update`. Since the engine can't reconcile Session A's old view with Session B's new reality, it forces a rollback.
#### 3. Serializable (The Single-File Door)
The most rigorous window is **Serializable**. It guarantees that the result of concurrent work is identical to if every transaction had walked through the door one at a time. This level prevents the **Write Skew** anomaly.
### Example: The Chef's Duty
The Cafe has a rule: "At least one chef must always be on call." Currently, two chefs (A and B) are active.
| Time | Session Manager 1 | Session Manager 2 |
| :--- | :--- | :--- |
| T1 | `BEGIN ISOLATION LEVEL SERIALIZABLE;` | `BEGIN ISOLATION LEVEL SERIALIZABLE;` |
| T2 | `SELECT count(*) FROM staff WHERE on_call;` <br> -- Returns 2 | |
| T3 | | `SELECT count(*) FROM staff WHERE on_call;` <br> -- Returns 2 |
| T4 | `UPDATE staff SET on_call = false WHERE id = 'Chef A';` | |
| T5 | | `UPDATE staff SET on_call = false WHERE id = 'Chef B';` |
| T6 | `COMMIT;` -- Succeeds | |
| T7 | | **`COMMIT;` -- ERROR!** |
In **Repeatable Read**, both would succeed, leaving 0 chefs on call—breaking the business rule.
In **Serializable**, the engine tracks that both managers relied on the same premise ("There are 2 chefs"). When Manager 1 commits, Manager 2’s premise is invalidated. Postgres detects this **Write Skew** and refuses to commit Session 2, forcing it to retry.
### Choosing Your Window
You can tell the engine exactly which window you want for your next transaction:
```sql
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
```
Remember: the clearer the window, the more work the engine has to do to keep it clean!
---
## 5.6 The Infinite Calendar (Transaction ID Wraparound)
![[assets/arch_xid_wraparound.png|450]]
Postgres is designed to live forever. But even the meticulous database engine's memory has a physical limit. In this exploration, we explore the database equivalent of the "Millennium Bug": **Transaction ID Wraparound**.
### The Box of Name Tags
Every time a new transaction starts, Postgres gives it a **Name Tag (Transaction ID)**. These tags are represented by a `uint32` number. As we've seen, these tags are crucial for **[[Manuscript/02 - Physical Storage & MVCC/2.5 - MVCC (The Sharpie Ledger)|MVCC]]**—they determine whether a suitcase is "visible" to you.
But a 32-bit number only goes up to roughly **4 billion**.
Imagine a cafe that has been open for a hundred years. Eventually, Postgres reaches Name Tag #4,294,967,295. It reaches into the box for the next tag, and... there are no more!
To save itself from an identity crisis, Postgres uses a **Circular Timeline**. It treats the 32-bit integer as a modular space.
**The Visibility Logic**: To decide if a transaction is in the past or the future, Postgres subtracts the two IDs and casts the result to a **signed 32-bit integer**.
- If the result is negative, the transaction is in the past.
- If positive, it is in the future.
This means that Postgres can only "remember" **2 billion** transactions into the past. Beyond that, the math flips, and a transaction from seventy years ago suddenly looks like a transaction from the distant future.
To save us from this doom, the database performs a ritual known as **FREEZE**.
When a suitcase is old enough and we know it will never be modified again, the **Autovacuum** worker (the Housekeeper) visits the page. It doesn't just "clean"—it physically changes the Transaction ID on the record header to a special reserved value: **`2` (FrozenTransactionId)**.
**The Magic Number**: In the Postgres source code, XID `2` is hard-coded to be "older" than every other possible ID. Once a record is stamped with `2`, it becomes part of the timeless furniture of the Cafe. It will always be visible to every transaction, regardless of how many billions of years have passed on the circular calendar.
To avoid the apocalypse, you must monitor the **Age of the Database**:
```sql
-- Checking the timeline distance to the End of the World
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
```
- **autovacuum_freeze_max_age (Default: 200M)**: When any table reaches this age, Postgres triggers a "wraparound prevention" vacuum. It will prioritize this work above almost everything else.
- **The Panic Zone (Age > 2 Billion)**: If the Housekeepers are blocked (perhaps by a long-running transaction from an elephant who fell asleep in the Looking Glass Room), the database will trigger a **Panic Shutdown**.
In panic mode, the meticulous database engine shuts down and refuses all new connections except for a single superuser session. It enters a read-only state until a manual `VACUUM` is performed to reset the calendar.
Transaction Wraparound is rare, but for massive, high-velocity databases, it is the ultimate "End Boss." By ensuring your housekeepers are healthy and your columns are regularly frozen, you ensure the Elephant Cafe can keep its doors open for a thousand years.
### Summary: The Reliability Checklist
Before you can truly sleep at night, ensure your database meets the **Postgres Reliability Standard**:
1. **Trust the Diary (`wal_level`)**: Is your Logbook detailed enough? Ensure it's set to `replica` (or `logical` if you're shouting news).
2. **Dry the Ink (`fsync`)**: Never turn off `fsync` unless you explicitly don't care about the cafe burning down.
3. **Mind the Space (`max_wal_size`)**: Ensure your pocket is big enough to hold the diary between checkpoints!
4. **Confirm the News (`Logical Replication`)**: Check your Heavy Metal Bookmarks (`pg_replication_slots`) for stale news that might fill up the pantry.
5. **Watch the Calendar (`datfrozenxid`)**: Monitor your Database Age to ensure the Big Bang isn't coming.
6. **The Atomic Seal (`BEGIN/COMMIT`)**: Keep your transactions short! Long-running sessions prevent the Housekeeper from cleaning and freezing.
By mastering the diary and the atomic seal, you've turned a wobbly pile of suitcases into a rock-solid monument of data.
---
In the next chapter, we're going to dive into the **Database Infrastructure**. We'll meet **The Hunger of Resources (Memory & Disk)** and see exactly how Postgres eats, breathes, and sweats CPU cycles.