02 - Physical Storage & MVCC

# Chapter 2: Physical Storage & MVCC ## 2.0 - Storage Foundations (The Building Blocks of Storage) <img src="assets/chap_1_storage_hierarchy.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> In the clean, abstract world of application logic, a query is just information. But to Postgres, data is a physical reality. It has a location on a disk, a specific byte size, and an associated cost to retrieve. ### What You'll Learn - How Postgres physically organizes data into **Tuples**, **Pages**, and **Relations** - Why the engine uses an append-only architecture instead of in-place updates - How **MVCC** enables concurrent reads without locking - What **TOAST** does when data exceeds the 8KB page limit Because writing and reading from persistent storage is one of the most computationally expensive operations a database performs, Postgres avoids unnecessary disk I/O at all costs. To achieve this, it relies on a strict physical hierarchy, organizing its data into standardized 8KB blocks known as **Pages**. ### The Logical vs. The Physical A common misconception is that the database operates directly on "rows" and "columns," as if managing a giant in-memory spreadsheet. In reality, a "Row" is a logical abstraction—a clean, version-less representation of data used by the application. But deep in the storage layer, Postgres manages **Tuples**. A Tuple is the physical realization of that data on disk, wrapped in headers that track its visibility and age. > [!IMPORTANT] > **The Storage Checkpoint**: If you remember one thing from this chapter, let it be this: **A Row is not what is stored.** A single logical row (the fact that "Babu is a Capybara") may be represented internally by dozens of physical tuples, scattered across different pages. ### The Append-Only Engine Postgres uses an **Append-Only** architecture for its primary storage. In the abstract world, a database "updates" a row. In the physical world, Postgres never updates anything in place. > [!NOTE] > **The Heap**: In database architecture, "The Heap" is the formal name for the primary storage file where table data lives. Unlike a sorted list, a Heap is fundamentally unordered—tuples are simply tossed into the first available gap. When you update a record, Postgres marks the old data as "dead" (obsolete) and writes a brand new version to a fresh location on disk. This is the foundation of **MVCC** (Multi-Version Concurrency Control). It allows one user to read the old version while another is writing the new one, ensuring that readers never block writers and writers never block readers. This system is the engine's primary strategy for avoiding the coordination cost of locking. ### The Storage Hierarchy The engine organizes data into a disciplined hierarchy to minimize the cost of fetching it: - **[[Manuscript/02 - Physical Storage & MVCC/2.1 - Data Types (The Ingredient Labels)|Data Types]]** — The bits and alignment of individual facts. - **[[Manuscript/02 - Physical Storage & MVCC/2.2 - Tuple (The Suitcase)|Tuples]]** — The physical realization of a row, wrapped in headers. - **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Pages]]** — The 8KB "shipping containers" that move between disk and memory. - **[[Manuscript/02 - Physical Storage & MVCC/2.4 - Relation (The Table)|Tables]]** — The collection of pages that form a logical entity. - **[[Manuscript/02 - Physical Storage & MVCC/2.5 - MVCC (The Sharpie Ledger)|MVCC]]** — The visibility rules that govern concurrent access. - **[[Manuscript/02 - Physical Storage & MVCC/2.6 - TOAST (The Separate Trailer)|TOAST]]** — The overflow valve for data that cannot fit in a page. ### The Result Set: A Logical Departure When you submit a query, you are asking Postgres to look through this messy physical heap and produce a temporary, highly filtered projection called a **[[Structures/Result Set|Result Set]]**. This is the point where the physical suitcase is finally unpacked and presented back to you as a weightless, logical row. The **Result Set** is decoupled from the physical reality on disk—it is a snapshot of truth, frozen in time and stripped of its bureaucratic headers. The core mechanism governing this behavior is **[[Architecture/MVCC|MVCC (Multi-Version Concurrency Control)]]**. By treating storage as a series of immutable versions rather than a mutable state, Postgres avoids the massive coordination overhead and locking required for in-place mutation. It simply lets the old data drift into the background (as "dead tuples"), leaving the cleanup to a dedicated background process called the **Autovacuum**. --- ## 2.1 - Data Types (Knicks, knacks, bits, and bobs) <img src="assets/bits_types_collection.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Postgres refuses to store abstract concepts. It needs to know exactly what size "box" to put things in before it can start packing. In Postgres, every field is strictly defined by a **Data Type**. A type is essentially a very practical architectural contract that specifies two things: 1. **Storage Footprint**: How many bytes Postgres must set aside for this value in the physical record. 2. **Valid Operations**: What mathematical or logical actions can be performed on the data (for example, you can't multiply a `date` by `saffron`). --- ### The Architectural Contract Think of data types as physical boundaries. If you try to force an 8-byte payload into a 2-byte slot, Postgres will reject it. If you put a 1-byte payload into an 8-byte slot, Postgres will shake its head at the waste. Imagine storing a single `boolean` (1 byte) in a `bigint` column (8 bytes). This isn't just a waste of space—it significantly increases the I/O cost of every query. Because Postgres is "Strongly Typed," once a column is defined as an `integer`, it strictly rejects any input that does not conform to that contract. This structure allows the engine to optimize data layout and ensure high-speed retrieval. To maintain performance, you must be aware of the storage overhead. If you frequently specify 8-byte types for data that fits in 2 bytes, you are unnecessarily bloating your records and reducing the number of records that fit within a single **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Data Page]]**. --- ### Standard Inventory | Category | Technical Name | Physical Size | | :--- | :--- | :--- | | **Logic** | `boolean` | 1 byte | | **Numeric** | `smallint` | 2 bytes | | **Numeric** | `integer` | 4 bytes | | **Numeric** | `bigint` | 8 bytes | > [!NOTE] > **The `text` vs. `varchar` implementation**: In Postgres, `text` and `varchar` use the same underlying storage representation (**`varlena`**). Both use a dynamic structure preceded by a small header (the "tag") that tells Postgres the content length. The only difference is that `varchar(n)` forces Postgres to perform an additional length check before storing; `text` skips this check. --- ### Alignment Padding Physical records aren't packed perfectly end-to-end. To ensure high-speed memory access, the CPU requires that certain data types start at specific byte boundaries (e.g., an 8-byte `bigint` must start on a multiple of 8). If you pack a 1-byte `boolean` followed by an 8-byte `bigint`, Postgres inserts **7 bytes of alignment padding** so the `bigint` can align comfortably on its boundary. You can observe this overhead directly using `pg_column_size()`: ```sql -- A single boolean and a bigint: 1 + 7 (padding) + 8 = 16 bytes SELECT pg_column_size(row(true, 1::bigint)); -- Two bigints and two booleans: 8 + 8 + 1 + 1 = 18 bytes (no padding between columns) -- Note: the total row size will still be aligned to the MAX alignment of the types. SELECT pg_column_size(row(1::bigint, 2::bigint, true, false)); ``` #### Optimal Column Ordering: Optimizing for Space Because of padding, the **order** of your columns determines the final physical weight of your data. | Order Style | Example Sequence | Final Size | | :--- | :--- | :--- | | **Bad Packing** | `bool, bigint, bool, bigint` | 32 bytes | | **Good Packing** | `bigint, bigint, bool, bool` | 24 bytes | By grouping your heaviest types together at the start of your table, you can shrink your storage requirements by **25%** without losing a single bit of information. --- ### OIDs: Postgres's Secret Labels Postgres doesn't identify types by their names internally. To the engine, "integer" or "timestamp" are just human-readable aliases. Internally, every object in the database is assigned a unique, 4-byte numeric label called an **OID (Object Identifier)**. Think of an OID as the **Internal Serial Number** for code objects. Strings are heavy and expensive to compare in C; integers are light and constant. When the query planner needs to pass a type into a function, it doesn't pass the word "integer"—it passes the number **23**. #### The Unified Catalog Almost everything in Postgres has an OID: - **Tables** (Relations) in `pg_class`. - **Functions** in `pg_proc`. - **Indexes** in `pg_index`. - **Types** in `pg_type`. By labeling every concept with a consistent numeric ID, Postgres can use the same indexing machinery to manage its own internal blueprints as it uses to manage your actual data. > [!NOTE] > **The Limit of the ID**: OIDs are "Cluster-Global" 32-bit integers, meaning they wrap around at ~4.2 billion. While this seems infinite, the risk of "Wraparound" is why Postgres retired OIDs for identifying user records, reserving them safely for the internal catalog blueprints. --- --- ## 2.2 - Tuple (The Suitcase) <img src="assets/arch_tuple_suitcase.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> A **Tuple** is the smallest physical unit of storage in Postgres. When you insert a logical "row" via SQL, Postgres physically writes a tuple to the disk. Most heap tuples are stored as fixed, immutable hunks of bytes. They contain your user data, plus a metadata header. Think of it as a **Suitcase**. The engine packs your data into it, slaps a routing label on the outside, and stores it in the system. ### The Immutable Update What happens when you update a row? Postgres does not modify the data in place. It utilizes an **Append-Only** architecture. When you run an `UPDATE`: 1. The engine marks the original tuple as obsolete. 2. It writes a completely new tuple with the new data. 3. It appends the new tuple to a fresh location in the **Heap** (the main table storage). Because the old tuple is still physically there, other active queries can continue reading it. They do not have to wait for your update to finish. This design is the foundation of **Multi-Version Concurrency Control (MVCC)**. > [!NOTE] > **In PostgreSQL Terms** > * **Tuple**: The physical realization of a row. > * **Heap**: The unordered physical storage area where tuples are appended. > * **MVCC**: The visibility model that allows concurrent reading of old versions. ### Tuple Anatomy Standard heap tuples follow a fixed physical layout. They are divided into two sections: the **Header** (metadata) and the **User Data** (your columns). #### 1. The Header (The Metadata) Standard heap tuples begin with a fixed-size header (**23 bytes**). This header contains the status bits and transaction IDs that allow Postgres to manage concurrency without locking. #### 2. The User Data (The Payload) This is where your actual data lives. It follows the header and is laid out in the order defined in your `CREATE TABLE` statement. To understand how Postgres makes these decisions at microsecond scale, we must look at the physical anatomy of the bytes on the wire: ```text +---------------------------------+ <-- Tuple Start | HEADER (23 bytes) | (xmin, xmax, ctid, flags) +---------------------------------+ | NULL BITMAP (Optional) | (Null Presence map) +---------------------------------+ | [ ALIGNMENT PADDING ] | (Memory Alignment) +---------------------------------+ | | | USER DATA | (Payload) | [Slot 1] [Slot 2] ... | | | +---------------------------------+ <-- Tuple End ``` ### Lab: Inspecting the Physical Reality To see the physical reality of a tuple, we can peek at the "system columns" that Postgres attaches to every row. Let's welcome **Cashew the Capybara** to the cafe: ```sql INSERT INTO animals (name, species_id) VALUES ('Cashew', (SELECT id FROM species WHERE name = 'Capybara')); SELECT ctid, xmin, xmax, name FROM animals WHERE name = 'Cashew'; -- ctid | xmin | xmax | name -- ---------+------+------+-------- -- (73,75) | 850 | 0 | Cashew ``` #### Decoding the Identity - **`ctid (73,75)`**: This is the physical address. Cashew lives on **Page 73**, at **Slot 75**. - **`xmin 850`**: This tuple was created and committed by **Transaction 850**. - **`xmax 0`**: No one has updated or deleted Cashew yet. By inspecting these headers, the engine determines precisely where the tuple is located and whether it is "visible" to your current transaction. This allows Postgres to manage history without erasers, though it eventually requires a "Vacuum" to reclaim space from obsolete versions. --- --- ### The Tuple Lifecycle Because a tuple is the physical realization of a row, its lifecycle is governed by two major architectural disciplines: 1. **[[Manuscript/02 - Physical Storage & MVCC/2.2.1 - Visibility & System Columns|Visibility & System Columns]]**: How the engine uses hidden headers (`xmin`, `xmax`, `ctid`) to decide which transactions can see which version of a row. 2. **[[Manuscript/02 - Physical Storage & MVCC/2.2.2 - Storage Optimizations|Storage Optimizations]]**: how the engine packs these tuples efficiently using Null Bitmaps, Alignment, and specialized techniques like **HOT** and **TOAST**. --- ## 2.2.1 - Visibility & System Columns <img src="assets/arch_visibility_passport.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Every tuple in the **Heap** carries a set of hidden markers that determine if a specific query is allowed to see it. These markers are the engine's primary tool for managing concurrency without locking. In the Elephant Cafe, we call these the **tuple attributes**. They tell the engine where the tuple came from, who owns it, and whether it has been logically retired. ### The System Columns There are three critical "System Columns" embedded in every tuple's header. You cannot see them in a `SELECT *`, but they are always there, guiding the engine's visibility logic. | Field | Size | Role | | :--- | :--- | :--- | | **`xmin`** | 4 bytes | The Transaction ID (XID) that **inserted** this tuple. | | **`xmax`** | 4 bytes | The Transaction ID that **deleted** or updated this tuple. | | **`ctid`** | 6 bytes | The physical location **(Page, Offset)** of this tuple. | To truly understand visibility, you must see it in action. Let's look at the physical identity of Babu the Elephant. Even though he looks like a weightless row in your application, he has a very specific physical coordinate and history. ```sql SELECT ctid, xmin, xmax, name FROM animals WHERE name = 'Babu'; ``` | ctid | xmin | xmax | name | | :---- | :--- | :--- | :--- | | (0,1) | 501 | 0 | Babu | #### Decoding the Attributes - **`ctid (0,1)`**: This tuple lives on **Page 0**, at **Slot 1**. The `ctid` is the physical address the engine uses to jump directly to the data. - **`xmin 501`**: This version of Babu was created by **Transaction 501**. - **`xmax 0`**: The `xmax` is empty. This means no transaction has deleted or updated this tuple yet. It is currently "Alive." ### The Multi-Version Reality Because Postgres uses **Append-Only** storage, an `UPDATE` is actually a two-step physical dance: 1. The engine sets the `xmax` of the old tuple to your current Transaction ID (marking it as dead). 2. The engine writes a brand-new tuple with a new `ctid` and sets its `xmin` to your current Transaction ID. > [!IMPORTANT] > **The Visibility Checkpoint**: Every query you run carries a **Snapshot**—a list of which transactions are finished and which are still in progress. Postgres compares your snapshot to the `xmin` and `xmax` of every tuple. If a tuple's `xmin` is from a finished transaction and its `xmax` is empty (or from a transaction that hasn't finished), you see it. Otherwise, it is invisible to you. This simple integer comparison is what allows one patron to see the "old" price of Saffron while another patron is in the middle of updating it. No locks are required; the engine simply looks at the tuple attributes and decides what's real for you. --- ## 2.2.2 - Storage Optimizations <img src="assets/arch_storage_opts.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Once you understand the anatomy of a tuple and its visibility attributes, the question becomes one of efficiency. How does Postgres pack these suitcases as tightly as possible, and what happens when the data is simply too large to fit? ### Efficiency: Nulls and Alignment Postgres is designed to minimize storage overhead within each tuple. To do this, it uses two primary techniques: the **Null Bitmap** and **Memory Alignment**. #### The Null Bitmap (Free Space) If a column is NULL, Postgres doesn't waste any space in the User Data section of the tuple. Instead, it flips a single bit in a specialized map in the header called the **Null Bitmap**. - **The Payoff**: NULLs are virtually free. A table with 100 columns, 90 of which are NULL, takes up almost the same space as a table with only 10 columns. #### Memory Alignment (The 8-Byte Rhythm) The CPU is most efficient when reading data that starts on a "Natural Boundary" (usually 8 bytes). If a column ends on an odd byte, Postgres adds **Padding Bytes** to ensure the next column starts at the correct interval. - **The Optimization**: By grouping fixed-width columns of the same size together (e.g., placing all `bigint` and `timestamp` columns at the start of the table), you can reduce the amount of padding required, potentially saving megabytes of space on a large table. --- ### Advanced Optimizations For the extreme edges of performance, Postgres employs two specialized techniques to handle updates and large data. #### 1. Heap-Only Tuples (HOT) High-frequency updates can lead to "Index Bloat" as the engine creates new physical versions of rows. To mitigate this, Postgres uses **HOT Updates** to link new versions together on the same page without updating every index. We will explore the mechanics and performance implications of this in **[[Manuscript/06 - Resource Management & Processes/6.5 - Tuple Bloat (Garbage Collection)|Chapter 6.5 - Tuple Bloat (Garbage Collection)]]**. #### 2. TOAST (The Separate Trailer) Postgres has a hard limit: a tuple cannot exceed the size of a single **8KB Page**. But what if you need to store a 10MB JSON document or a high-resolution olfactory map? The engine uses **TOAST** (The Oversized-Attribute Storage Technique). If a value is too large (>2KB), Postgres "TOASTs" it: 1. The large value is compressed and moved to a separate, hidden "TOAST table." 2. The original tuple is left with an **18-byte "Claim Check"** (a pointer) in its place. 3. When you query the row, the engine automatically fetches the data from the trailer and presents it to you as if it were right there in the suitcase. > [!TIP] > This is why Postgres can handle massive blobs of data without slowing down sequential scans. The engine only "unpacks" the large TOAST data if you explicitly ask for that specific column. --- ## 2.3 - The Page (The Shipping Container) <img src="assets/arch_page_container.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> If Postgres requested tuples individually from disk, the I/O overhead would be catastrophic. Physical disks excel at moving large contiguous blocks, but they struggle with tiny, random requests. To solve this, the engine organizes storage into fixed-size **Pages** (or Blocks). By default, every page is **8KB**. Think of it as a **Shipping Container**. The engine moves containers, not individual items. Postgres reads and writes these 8KB chunks regardless of whether they hold one record or a thousand. The engine does not fetch individual rows from disk; it fundamentally fetches the 8KB page that contains the row. > [!NOTE] > **In PostgreSQL Terms** > * **Tuple**: The physical row record. > * **Page**: The 8KB physical block that holds multiple tuples. > * **I/O Unit**: Postgres fundamentally reads and writes in discrete 8KB blocks. ### Why 8KB? The Goldilocks Compromise Postgres chooses 8KB as the optimal balance for modern hardware: 1. **Hardware Alignment**: Most SSDs use 4KB physical sectors. An 8KB page fits cleanly into two sectors, avoiding slow and risky partial-sector writes. 2. **OS Harmony**: Linux manages memory in 4KB pages; an 8KB rhythm keeps the buffer pool and storage stack in lockstep. 3. **B-Tree Efficiency**: 8KB is large enough for high "fan-out," keeping search trees shallow—often just three or four hops to find a row in a billion-row table. In short: most trips to the disk are sized to align with the hardware's optimal I/O unit. The internal structure of a Postgres page is designed to maximize contiguous free space. This allows for efficient on-page defragmentation. The page uses a dual-directional growth strategy. **Item Identifiers** grow from the front of the page toward the back, while **Tuples** are written from the back toward the front. They grow toward each other into a shared gap of **Free Space**. By maintaining a single, contiguous gap in the middle, Postgres can easily accommodate new records while optimizing for memory alignment. ### The Schematic ```text +---------------------------------+ <-- Page Start (Offset 0) | PageHeaderData (24 bytes) | (Control metadata) +---------------------------------+ <-- pd_lower | Item Identifiers (4 bytes ea)| (Line pointers) | [1] [2] [3] [4] ... | (Grow DOWN ↓) +---------------------------------+ | | | FREE SPACE | (Contiguous space) | | +---------------------------------+ <-- pd_upper | | | TUPLES (Data) | (Physical records) | ... [4] [3] [2] [1] | (Grow UP ↑) | | +---------------------------------+ | Special Space (Optional) | (Index internal pointers) +---------------------------------+ <-- Page End (8,192 bytes) ``` ### The Parts of the Page 1. **The Page Header (`PageHeaderData`):** At the beginning of a standard page is a fixed 24-byte header. It records the page version, a checksum, and the memory offsets (`pd_lower` and `pd_upper`) that define the boundaries of the free space. The header also contains a marker tying the physical page to the database's recovery history. This helps the engine determine precisely which WAL records have already been applied to this specific page. 2. **The Item Identifiers (`ItemIdData`):** These act as an indirection layer. Each 4-byte identifier points to the physical offset of a tuple within the page. As new tuples are added, identifiers grow forward, moving the `pd_lower` pointer deeper into the page. > [!NOTE] > **Why the Indirection?**: This layer of indirection is vital for performance. It allows Postgres to "defragment" the page (moving tuples around to reclaim space) without changing the tuple's logical ID (the Item Identifier index). The outside world (indexes and table scans) continues to reference the Item ID, oblivious to the tuple's physical migration within the page. 3. **The Tuples:** The actual data records are persisted starting from the end of the page and moving toward the front. As new tuples arrive, they are placed in the next available space, shifting the `pd_upper` pointer backward. 4. **The Special Space:** At the end of the page is an optional area used primarily by **Index Pages** to store metadata like sibling pointers (to the left and right pages in a B-Tree), enabling fast transversal without returning to the index root. 5. **The Free Space:** This is the contiguous gap between `pd_lower` and `pd_upper`. When these two pointers meet, the page is considered full. No further tuples can be inserted until the **Autovacuum** worker reclaims space by removing tuples marked as dead. **The rule of thumb:** The Item Identifiers (pointers) and the Tuples (data) grow toward each other from opposite ends of the page. The page is "full" when they finally meet in the middle. ### Page Checksums To detect silent data corruption — bit-rot in the storage layer, incomplete writes, cosmic rays — Postgres can maintain a checksum on its pages. When a page is written to disk, the engine computes a CRC over its contents and stores it in `pd_checksum`. When the page is read back, the engine recomputes the checksum and compares. A mismatch raises a data corruption error rather than silently returning garbage to the user. If the power dies while the OS is in the middle of writing an 8KB page, the result is a **Torn Page**. This occurs because the underlying hardware often writes in smaller sectors (e.g., 512 bytes or 4KB). A crash leaves the page in a physically corrupted state, half-new and half-old. Postgres prevents this with **Full Page Writes (FPW)**. The first time a page is modified after a **[[Manuscript/05 - Durability & Transactions/5.2 - Crash Recovery (The Recovery Parade)|Checkpoint]]**, Postgres copies the *entire 8KB block* into the WAL. > [!TIP] > While FPW prevents corruption, it causes significant **Write Amplification**. A 1-byte change to a tuple can trigger an 8KB write to the WAL. Tuning the `checkpoint_timeout` is the primary way to manage this overhead. ### Inspecting a Page (`pageinspect`) You can peek at the actual physical layout of a page using the `pageinspect` extension: ```sql CREATE EXTENSION IF NOT EXISTS pageinspect; SELECT * FROM page_header(get_raw_page('animals', 0)); -- lsn | checksum | flags | lower | upper | special | pagesize -- ------------+----------+-------+-------+-------+---------+---------- -- 0/16A5E88 | 0 | 0 | 28 | 8160 | 8192 | 8192 ``` The output exposes the `lsn` (the last WAL record that touched this page), the `pagesize` (typically 8192 bytes), and the `lower`/`upper` pointers that bound the free space. This is the architecture of the page made visible from a SQL session. --- ## 2.4 - Relation (The Table) <img src="assets/arch_table_depot.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> While the **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Page]]** is Postgres's physical I/O unit, the **Table** (or Relation) provides the logical organization and schema mapping. ### The Heap: A Study in Disorder In the Elephant Cafe, a table is not an ordered list. It is a **Heap**—a physical file where data is stored with no inherent logical order. When you insert a new animal, Postgres does not try to find "the right spot" alphabetically; it simply looks for the first available gap in its shipping containers and tosses the tuple inside. This "toss it in" approach is why writing to a table is so fast, but it’s also why searching it is so slow. Because tuples are scattered based on arrival time and available space (managed by the **Free Space Map**), the engine can never "guess" where a specific record lives. ### The Logical Blueprint A table is defined by its **Schema**. This is the architectural blueprint that specifies which types of data are permitted in each column. Every tuple that enters the table must strictly adhere to this blueprint, or Postgres will reject it at the gate. This structure is what allows Postgres to be efficient. Because the database knows the exact size and type of every column, it can calculate physical offsets with mathematical precision, jumping directly to the data it needs without having to "read" every byte in between. ```sql -- Defining the schema for the species table CREATE TABLE species ( id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, name TEXT UNIQUE NOT NULL, diet_type diet_category NOT NULL ); ``` Once the schema is defined, Postgres allocates the initial data files required to persist incoming 8KB pages onto the disk. ### Physical Segmentation (Segments) A table is physically represented as a series of 8KB pages stored in a file. However, most modern filesystems get grumpy at extreme file sizes. To manage this, Postgres imposes a strict **1GB threshold** for any single physical file. If a table grows beyond 1GB, Postgres simply starts a new **Segment** file directly next door (e.g., `16385`, `16385.1`, `16385.2`). To the end-user, the table is a single logical entity; at the storage layer, it is a multi-file sequence of 1GB segments. You can inspect the literal file path of a relation on disk: ```sql -- Where is the 'animals' relation located on my disk? SELECT pg_relation_filepath('animals'); -- Output: base/13593/16385 ``` ### Table Forks: The Side Rooms A relation is not merely a single list of data pages. For performance and management, Postgres separates different functional requirements into distinct files called **Forks**, or specialized **Side Rooms** in the table structure. The primary data is stored in the **Main Fork**. However, Postgres maintains supplementary structures to optimize access: - **Free Space Map (`_fsm`)**: A quick sketch showing which pages have empty space, allowing Postgres to quickly locate a target for new tuples. - **Visibility Map (`_vm`)**: A bitmask indicating which pages hold ONLY fully alive, completely visible records. This is critical for **Index-Only Scans**; if Postgres knows the whole page is safe, it doesn't even have to read the table data to check visibility rules! > [!TIP] > **Segments vs. Forks**: Segments are continuations of the same file sequence (more space), while Forks are entirely different files serving specific functional requirements (like visibility tracking). ### Access Patterns: The Sequential Scan In the absence of an index, Postgres must perform a **Sequential Scan** (Full Table Scan). It begins reading at the first page of the relation and proceeds linearly through every segment until the predicate is satisfied or the end of the file is reached. For large relations, this results in significant I/O latency and CPU overhead as every page must be loaded into memory and processed. ### Storage Reclamation (The Autovacuum) Because data is abandoned rather than erased, tables naturally accumulate "dead" tuples over time—a state known as **Bloat**. To reclaim this space, the **Autovacuum** background process scans pages to identify dead tuples and mark their space as reusable in the **Free Space Map**. This prevents the physical file size from growing indefinitely. --- ## 2.5 - MVCC (The Sharpie Ledger) <img src="assets/arch_mvcc_sharpie_v3.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Postgres utilizes **MVCC** (Multi-Version Concurrency Control). When a record is updated, the engine does not overwrite existing data. Instead, it marks the original version as obsolete and appends a new version to the heap. Think of it as a **Sharpie Ledger**: changes are never erased, only crossed out. Postgres enforces this using two transaction-ID markers in the tuple header: `xmin` (the creator) and `xmax` (the eraser). By appending new versions instead of updating in-place, the engine sidesteps the need for heavy table-level locks. This ensures that readers never block writers, and writers never block readers. > [!TIP] > **The Key Benefit**: In Postgres, **readers never block writers, and writers never block readers.** A reader holds onto its consistent snapshot of the old version while a writer is busy committing a new one. > [!NOTE] > **In PostgreSQL Terms** > * **Tuple**: The physical row version. > * **Heap**: The unordered physical table storage. > * **MVCC**: Multi-Version Concurrency Control (the visibility model). --- ### The Mechanics of Versioning When you execute an `UPDATE`, Postgres performs a three-step logical swap: 1. **Expire the current version**: The physical **Tuple** remains in place, but its header is modified. Postgres stamps `xmax` with the current Transaction ID. This signals that future transactions should treat the row as deleted. 2. **Write the new version**: A fresh tuple is appended to the next available space in the heap. Its `xmin` is set to the current Transaction ID. 3. **Physical Linkage**: The `ctid` (physical address) of the old tuple is updated to point to the new one. This creates a version chain that the engine can walk to find the latest record. ```mermaid sequenceDiagram participant W as Transaction 847<br/>(Writer) participant DB as Postgres Heap participant R as Transaction 848<br/>(Reader) Note over DB: [Tuple 1] xmin: 800, xmax: 0 W->>DB: UPDATE animals SET name = 'Gilly' Note over DB: [Tuple 1] xmax: 847 (Marked) Note over DB: [Tuple 2] xmin: 847 (New) R->>DB: SELECT * FROM animals Note right of R: Snapshot: [Active: 847] DB-->>R: Returns Tuple 1 Note right of R: Visibility: xmax (847) is in Active Set<br/>So Tuple 1 is still visible. W->>DB: COMMIT Note over R: Snapshot persists.<br/>Further reads still see Tuple 1. ``` Concurrent transactions see whichever version of the row matches their own visibility snapshot. They skip anything with an `xmin` greater than their own ID or an `xmax` that has already committed. --- ### 🧪 Lab Challenge: The Ghost in the Machine (Update Mechanics) **The Request**: "Prove that Postgres doesn't update rows in place. Show me the 'Ghost' of the old version." #### The Investigation First, find a specific animal and note its physical address (`ctid`) and creator (`xmin`). ```sql SELECT ctid, xmin, name FROM animals WHERE name = 'Glowy'; ``` **Result**: ```text ctid | xmin | name ---------+------+------- (73,73) | 846 | Glowy ``` Now, perform an `UPDATE` and check the same columns: ```sql UPDATE animals SET name = 'Glowing Gilly' WHERE name = 'Glowy'; SELECT ctid, xmin, name FROM animals WHERE name = 'Glowing Gilly'; ``` **Result**: ```text ctid | xmin | name ---------+------+--------------- (73,74) | 847 | Glowing Gilly ``` #### The Diagnosis The `ctid` moved from `(73,73)` to `(73,74)`. The engine didn't change the data at the old address; it wrote a **brand new tuple** at a new address. #### The Reward To see the "Ghost" left behind, we can peek at the hidden `xmax` of the old record (requires a specific query to bypass the normal visibility filters): ```sql -- Peeking at the expired version SELECT ctid, xmin, xmax, name FROM animals WHERE ctid = '(73,73)'; ``` **Result**: ```text ctid | xmin | xmax | name ---------+------+------+------- (73,73) | 846 | 847 | Glowy ``` The old version still exists on disk! Its `xmax` is now set to **847** (the ID of your update transaction). To your transaction, this row is a "Ghost"—it occupies physical space but is logically dead. > [!NOTE] > **Recap**: `UPDATE` = `INSERT` + `DELETE`. Every update creates a new physical record. The old one remains as "Bloat" until the **Autovacuum** reclaims the space. --- ### The Ghost of a ROLLBACK Because Postgres is append-only, it writes the physical tuple to disk *before* the transaction officially commits. If a transaction creates a new row and then encounters a `ROLLBACK`, that physical data still lives in the heap. | ctid | xmin | xmax | id | animal_id | status | Visibility | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | **(0,3)** | **102** | 0 | 2 | 8 | Salad | **Invisible (XID 102 aborted)** | No future transaction will ever see this record—the engine's visibility map knows that XID 102 aborted—but the bytes still occupy physical space in the 8KB page. These dead records are known as **Bloat**. This design prioritizes write speed today by deferring the cost of cleanup. However, it leaves a performance debt that must be settled later. We cover the process of reclaiming this space in detail in **[[Manuscript/06 - Resource Management & Processes/6.4 - Vacuum & Freezing (The Housekeepers)|Chapter 6]]**. --- --- ## 2.6 - TOAST (The Separate Trailer) <img src="assets/toast_dinosaur.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Postgres is stubborn about its 8KB page limit. However, it must also handle large data types like 10MB JSON blobs. Instead of breaking the laws of storage physics, the engine delegates the problem. That delegation is called **TOAST** (*The Oversized-Attribute Storage Technique*). Think of it as a **Separate Trailer** parked behind the table. When an attribute is too large for the 8KB main page, Postgres compresses it and moves it out of the main heap. It leaves behind an 18-byte pointer—a "claim check"—in the original tuple. This design keeps the main heap compact and fast to scan. The trade-off is that retrieving the oversized column requires a second lookup. When an attribute exceeds roughly **2KB**, the engine works through a delegation strategy: 1. **Compression**: Postgres first tries to compress the value using a built-in algorithm (PGLZ or LZ4). If the compressed value fits in the page, it stays inline. 2. **External Storage**: if compression isn't enough, the value is split into chunks and stored in a dedicated *TOAST table* associated with the parent table. 3. **Indirection**: the original tuple is updated with an external pointer that references the OID of the TOAST table and the chunk identifiers. When you query the column, Postgres reassembles the chunks transparently. You see one logical record, while the engine fetches the chunks from external storage and reassembles them in memory. ### The Four Storage Strategies Each column declares a TOAST strategy that controls how the engine handles oversized values: | Strategy | Behavior | | :----------- | :------------------------------------------------------------------------------------------------ | | **PLAIN** | TOAST disabled entirely. If the row doesn't fit in 8KB, Postgres throws an error. | | **EXTENDED** | Attempts compression first; falls back to external storage if still too large. **Default** for variable-length types. | | **EXTERNAL** | Skips compression entirely and moves the value to external storage. Useful when the data is already compressed (e.g. JPEG, gzipped JSON). | | **MAIN** | Attempts compression but keeps the value inline as long as it possibly fits. | You can inspect (or override) the strategy per column with `ALTER TABLE ... SET STORAGE`. ### 🧪 Lab Challenge: TOASTing a Large Record **The Request**: "Store a massive amount of text in a single column. Prove that Postgres moves it to external storage when it gets too large." #### The Investigation We can use `pg_column_size()` to see the physical size of a value as it is stored in the tuple. ```sql -- 1. A small name fits comfortably inline SELECT pg_column_size('Cashew'::text) AS size; -- Result: 7 bytes ``` Now, let's create a massive biography. By default, Postgres uses the **EXTENDED** strategy, which attempts to compress the data first. ```sql -- 2. A 5,000 character biography is highly compressible SELECT pg_column_size(repeat('A', 5000)) AS size; -- Result: 69 bytes ``` #### The Diagnosis Wait—why is the 5,000-character string only 69 bytes? Postgres used its default compression (PGLZ). Because the compressed result (69 bytes) fits easily in the 8KB page, it stays **Inline**. To force the data into **external storage**, we must either exceed the compression limit or disable compression entirely. #### The Lazy Fix Disable compression for the column to force external storage: ```sql ALTER TABLE toast_test ALTER COLUMN biography SET STORAGE EXTERNAL; INSERT INTO toast_test (biography) VALUES (repeat('B', 5000)); -- Check the size now SELECT pg_column_size(biography) AS size FROM toast_test; ``` **Result**: ```text size ------- 5000 ``` #### The Reward At 5,000 bytes (uncompressed), this attribute has outgrown the ~2KB threshold. It has been physically moved to a **TOAST Table** (found in the `pg_toast` schema). The main table now only contains an 18-byte pointer. > [!WARNING] > **The `SELECT *` Penalty**: Because TOASTed values live out-of-line, every `SELECT *` on a large table forces the engine to fetch and reassemble chunks—even if you never read that column. This "TOAST Tax" can significantly degrade performance on tables with many large text or JSONB fields.