# Write-Ahead Logging (WAL) **Write-Ahead Logging (WAL)** is the standard method for ensuring data integrity and durability in PostgreSQL. It ensures that changes to data files (tables and indexes) are logged before the changes are actually applied to the data files themselves. For the narrative explanation of this concept, see [[Chapter 4/4.1 - The Pocket Diary (WAL & fsync)|4.1 Write-Ahead Logging (The Pocket Diary)]]. ## Core Principles 1. **Durability**: By logging changes to a durable log file first, Postgres can recover from a crash by replaying the log (redo) to bring the data files to a consistent state. 2. **Sequential I/O**: Writing to the WAL is a sequential operation, which is significantly faster than the random I/O required to update various pages across large data files. 3. **Atomic Commits**: A transaction is considered "committed" only after its WAL records have been successfully flushed to the Container Depot (via `fsync`). ## The Anatomy of the Log To truly understand how Postgres achieves its legendary durability and replication speed, we must look at the physical structure of the Write-Ahead Log. ### 1. The Log Sequence Number (LSN) The **LSN (Log Sequence Number)** is the absolute coordinate system of the Postgres universe. It is a 64-bit integer (`uint64`) that represents a specific byte-offset within the infinite stream of WAL data generated since the database cluster was initialized. - Because Postgres only ever appends to the WAL, the LSN is permanently increasing. It never goes backward. - Every single change to the database is stamped with an LSN. - Data pages (Shipping Containers) store the LSN of the last WAL record that modified them in their header (`pd_lsn`). This allows the system to compare the physical Page's LSN against the WAL's LSN during crash recovery to determine if a replay is necessary, guaranteeing **Idempotent Recovery**. ### 2. The XLogRecord Every individual change (a "scribble") is written as an **`XLogRecord`** C-struct. This is the fundamental unit of the WAL. An `XLogRecord` contains: - **`xl_tot_len`**: The total length of the record. - **`xl_xid`**: The Transaction ID that generated the record. - **`xl_prev`**: A pointer back to the previous WAL record for this transaction, allowing Postgres to walk backward during transaction aborts. - **`xl_info`**: Metadata identifying exactly which resource manager (e.g., Heap, B-Tree, GIN) is responsible for applying the change. - **Payload**: The actual binary block references and data payloads representing the change. ### 3. WAL Segments (`pg_wal`) The infinite stream of WAL data must be stored physically on the host OS. Postgres chops this stream into manageable **16MB Segment Files**, stored in the `pg_wal` (formerly `pg_xlog`) directory. - The filenames appear as 24-character hex strings (e.g., `000000010000000000000001`), which uniquely encode the Timeline ID and the logical segment identifier. - When a 16MB file fills up, Postgres immediately opens the next one. - During a **Checkpoint**, the background writer ensures that all data corresponding to old WAL segments is safely fsync'd into the main table files. Once verified, those old 16MB segments are either deleted or recycled (renamed for future use) to prevent the `pg_wal` directory from filling up the disk. ## Maintenance and Operations - **Checkpoint**: The process of flushing all "dirty" data pages from memory to the Container Depot and recording a safe point in the WAL. This allows old WAL segments to be recycled. - **Full Page Writes (FPW)**: To prevent data corruption from **partial page writes** (torn pages), the first change to a page after a checkpoint requires the entire 8KB page to be logged in the WAL. - **The Risk**: Filesystems (like ext4 or XFS) typically write in 4KB blocks. If the power fails between the first and second 4KB write of an 8KB Postgres page, the data file is left with a corrupted, "half-old, half-new" page. - **The Solution**: During recovery, if a page is torn, Postgres overwrites it with the full-page image from the WAL before replaying the remaining scribbles. - **Performance**: High `full_page_writes` activity ("WAL bloat") occurs immediately after a checkpoint. Tuning the `checkpoint_timeout` and `max_wal_size` is critical for managing this I/O spike. - **Archiving**: WAL segments can be shipped to a remote location for Point-In-Time Recovery (PITR) and replication. ## See Also - [[Architecture/Transactions|Transactions]] - [[Architecture/MVCC|MVCC]] - [[Resources/Disk IO|Disk I/O and Performance]]