# Chapter 2: Physical Storage & MVCC ## Chapter 2: The Building Blocks of Storage ![[assets/chap_1_storage_hierarchy.png|450]] In the clean, abstract world of application logic, a query is just information. But to Postgres, data is a physical reality. It has a location on a disk, a specific byte size, and an associated cost to retrieve. Because writing and reading from persistent storage is one of the most computationally expensive operations a database performs, Postgres operates with a philosophy of **Meticulous Laziness**. It avoids unnecessary disk I/O at all costs. To achieve this, it relies on a strict physical hierarchy, organizing its data into standardized 8KB blocks known as **Pages** (or the Cafe's **Shipping Containers**). ### The Logical vs. The Physical A common misconception is that the database operates directly on "rows" and "columns," as if managing a giant in-memory spreadsheet. In reality, a "Row" is a logical abstraction. When you execute a query, you are interacting with a **Row**—a clean, version-less representation of data. But deep in the storage layer, Postgres manages **Tuples**. A Tuple is the physical realization of that row on disk. Think of the Tuple as an **immutable suitcase** packed into the engine's internal **Heap**. Because Postgres uses an **Append-Only** architecture, it treats these physical suitcases as fixed blocks. It does not overwrite the contents of a suitcase when a record is updated. Instead, it marks the old tuple as dead and writes a brand new version to a fresh location. Consequently, a single logical "Row" requested by a user may be represented internally by multiple physical "Tuples," only one of which is currently visible to your transaction. ### The Storage Hierarchy Postgres organizes the **Elephant Cafe** into a disciplined hierarchy designed to minimize the cost of fetching data: 1. **[[Manuscript/02 - Physical Storage & MVCC/2.1 - Data Types (Knicks, knacks, bits, and bobs)|Data Types]] (The Bits & Bobs):** The smallest unit. Every fact has a specific byte alignment and a label. 2. **[[Manuscript/02 - Physical Storage & MVCC/2.2 - Tuple (The Physical Suitcase)|Tuples]] (The Suitcases):** The actual record, wrapped in header metadata that tracks its visibility and transaction ID. 3. **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Pages]] (The Shipping Containers):** The standard unit of I/O. Postgres reads and writes data in strict 8KB pages. It cannot fetch a single tuple without loading the entire 8KB container into memory. 4. **[[Manuscript/02 - Physical Storage & MVCC/2.4 - Relation (The Table)|Tables]] (The Relation):** A logical collection of these 8KB pages spanning physical files on the disk. 5. **[[Manuscript/02 - Physical Storage & MVCC/2.5 - MVCC (The Sharpie Ledger)|The Sharpie Ledger]] (MVCC):** The engine's system for managing concurrent access by never erasing old tuples, relying instead on visibility rules. 6. **[[Manuscript/02 - Physical Storage & MVCC/2.6 - TOAST (The Separate Trailer)|TOAST]] (The Trailer):** A specialized storage system for data that is too large to fit inside a single 8KB page. ### The Result Set: A Logical Departure When you submit a query, you are asking Postgres to look through this messy physical heap and produce a temporary, highly filtered projection called a **[[Structures/Result Set|Result Set]]**. This is the point where the physical suitcase is finally unpacked and presented back to you as a weightless, logical row. The **Result Set** is decoupled from the physical reality on disk—it is a snapshot of truth, frozen in time and stripped of its bureaucratic headers. The core mechanism governing this behavior is **[[Architecture/MVCC|MVCC (Multi-Version Concurrency Control)]]**. By treating storage as a series of immutable versions rather than a mutable state, Postgres avoids the massive coordination overhead and locking required for in-place mutation. It simply lets the old data drift into the background (as "dead tuples"), leaving the cleanup to a dedicated background process called the **Autovacuum**. --- ## 2.1 Data Types (Storage Footprints & Contracts) ![[assets/bits_types_collection.png|450]] Postgres refuses to store abstract concepts. It needs to know exactly what size box to put things in. In Postgres, every field is strictly defined by a **Data Type**. A type is essentially a very practical architectural contract that specifies two things: 1. **Storage Footprint**: How many bytes Postgres must set aside in the physical suitcase (the Tuple). 2. **Valid Operations**: What mathematical or logical actions can be performed on the data (you can't multiply a "date" by "saffron"). Think of data types as **Architectural Contracts**. Every piece of data needs a specific amount of physical space. If you try to force an 8-byte payload into a 2-byte slot, Postgres will reject it; if you put a 1-byte payload into an 8-byte slot, Postgres will shake its head at the waste. Imagine storing a single `boolean` (1 byte) in a `bigint` column (8 bytes). This isn't just a waste of space—it significantly increases the I/O cost of every sequential scan. Because Postgres is a "Strongly Typed" system, once a column is defined as an `integer`, it strictly rejects any input that does not conform to that type's storage contract. This rigidity is what allows Postgres to optimize data layout and ensure high-speed retrieval. To maintain high performance, you must be aware of the storage overhead of your chosen types. If you frequently specify 8-byte types for data that fits in 2 bytes, you are unnecessarily bloating the Tuple and reducing the number of records that fit within a single **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Data Page]]**. Here are the standard items Postgres is used to packing: | Category | Technical Name | Physical Size | | :--- | :--- | :--- | | **Logic** | `boolean` | 1 byte | | **Numeric** | `smallint` | 2 bytes | | **Numeric** | `integer` | 4 bytes | | **Numeric** | `bigint` | 8 bytes | > [!NOTE] > **The `text` vs. `varchar` implementation**: In Postgres, `text` and `varchar` use the same underlying storage representation (**`varlena`**). Both use a dynamic structure that expands to fit the content, preceded by a small header (the "tag") that tells Postgres exactly how many bytes follow. The only functional difference is that `varchar(n)` forces Postgres to perform an additional length check before storing the value; `text` skips this check. ### The Invisible Air: Alignment Padding One of the most surprising facts about physical Tuples is that they aren't packed perfectly end-to-end. To ensure high-speed memory access, the CPU requires that certain data types start at specific byte boundaries (e.g., an 8-byte `bigint` must start on a multiple of 8). If you pack a 1-byte `boolean` followed by an 8-byte `bigint`, Postgres doesn't just squeeze them together. It inserts **7 bytes of "Invisible Air"** (Alignment Padding) so the `bigint` can align comfortably on its 8-byte boundary. > [!IMPORTANT] > **The Column Tetris**: Because of padding, the **order** of your columns determines the final physical weight of your Tuple. > - **Bad Packing**: `bool, bigint, bool, bigint` = 32 bytes of data. > - **Good Packing**: `bigint, bigint, bool, bool` = 24 bytes of data. > By simply grouping your heaviest types together at the start, you can shrink your storage requirements by 25% without losing a single bit of information. #### Auditing the Air You can see the literal bytes of a suitcase using the `pg_column_size` function. Let's look at the `animals` table: ```sql -- Check the padding in our current suitcase SELECT id, pg_column_size(id) AS id_bytes, pg_column_size(name) AS name_bytes, pg_column_size(species_id) AS spec_bytes, pg_column_size(created_at) AS date_bytes, pg_column_size(a) AS total_tuple_bytes FROM animals a LIMIT 1; ``` --- ### OIDs: Postgres's Secret Labels Postgres does not identify data types by their names. To the engine, "integer" or "timestamp with time zone" are just human-readable aliases. Internally, every object in the database is assigned a unique, 4-byte numeric label called an **OID (Object Identifier)**. Think of an OID as the **Internal Serial Number** for code objects. #### Why not just use strings? Because Postgres is written in **C**, efficiency is paramount. Strings are heavy, variable-length, and expensive to compare. An integer, however, is light and constant. When the query planner or the executor needs to pass a data type into a function, it doesn't pass the word "integer"—it passes the number **23**. This drastically reduces the overhead of internal operations. ### The Unified Catalog The power of the OID lies in its **Unification**. It isn't just for data types; almost everything in the Cafe has an OID. - **Tables** (Relations) have OIDs in the `pg_class` catalog. - **Functions** have OIDs in the `pg_proc` catalog. - **Indexes** have OIDs in the `pg_index` catalog. - **Types** have OIDs in the `pg_type` catalog. By labeling every diverse concept in the database with a consistent numeric ID, Postgres can use the same indexing and retrieval machinery to manage its own internal blueprints as it uses to manage your animal records. > [!NOTE] > **The Limit of the ID**: OIDs are "Cluster-Global," meaning they are unique across all databases in a single Postgres instance (the Cafe). They are not "Universe-Global" like a UUID. Because they are 32-bit integers, the internal counter wraps around at ~5.2 billion. While 4 billion tables or data types might seem infinite, the risk of "Wraparound" is why Postgres retired OIDs as the primary way to identify user records, reserving them safely for the catalog blueprints. You can inspect the secret OID for any type using a cast: ```sql -- What is Postgres's secret name for an integer? SELECT 'integer'::regtype::oid; -- Literal Output: 23 ``` --- ### Composite Types: Groups of Bits If standard types aren't enough, you can group them together into a **Composite Type**. ```sql CREATE TYPE contact_info AS ( phone TEXT, email TEXT ); ``` In fact, Postgres is so consistent in its type system that **every table you create is automatically treated as a composite type**. When you create the `animals` table, you also create a new data type label that Postgres can use to pass an entire "animal" record around as a single value. By understanding these "knicks and knacks," we can begin to see how Postgres builds the massive, complex world of the database from strict, well-labeled byte contracts. --- ## 2.2 Tuple (The Physical Suitcase) ![[assets/arch_tuple_suitcase.png|450]] When dealing with the physical storage of Postgres, the concept of a "row" is an abstraction. Down in the storage layer, there are only **Tuples**. A Tuple is the smallest physical unit of storage in the heap. Think of it as a **Physical Suitcase**—a rigid, immutable hunk of bytes packed with your record and its accompanying metadata header. Once a tuple is committed, Postgres treats it as a permanent record; it cannot be unlatched to change a single byte. Because Postgres utilizes an append-only architecture, it treats these suitcases as immutable. If a value changes, Postgres doesn't unpack the old suitcase; it simply leaves it where it sits, marks it as obsolete, and packs a brand new suitcase in a fresh location. To Postgres, a single logical "Row" is actually a stack of multiple physical "Tuples" representing history. --- ### The Tuple Structure Every tuple follows a rigid physical layout defined by the Postgres source code (specifically `htup_details.h`). ### The Schematic ```text +---------------------------------+ <-- Tuple Start | t_xmin (4 bytes) | (Inserting Transaction ID) +---------------------------------+ | t_xmax (4 bytes) | (Deleting Transaction ID) +---------------------------------+ | t_cid (4 bytes) | (Command Identifier) +---------------------------------+ | t_ctid (6 bytes) | (Physical self-pointer/succesor) +---------------------------------+ | t_infomask (2 bytes) | (Status bitflags) +---------------------------------+ | [ PADDING / ALIGN ] | (Memory Alignment) +---------------------------------+ | NULL BITMAP (Optional) | (Null Presence map) +---------------------------------+ | | | USER DATA | (Payload) | [Slot 1] [Slot 2] ... | | | +---------------------------------+ <-- Tuple End > [!NOTE] > Each column (slot) within the tuple is accessed via its offset. The engine uses the **Table Schema** to calculate the jumping distance required to find a specific attribute, ensuring it never misinterprets bytes from one column as another. ``` --- Every Tuple begins with a fixed-size header known as `HeapTupleHeaderData`. You can visualize this 23-byte structure as the **[[Structures/Tuple|Passport]]** of the record. This header acts as the identity card that Postgres uses to enforce **Visibility**. By checking the stamps in the header, the engine can determine whether a specific transaction is allowed to see this record or if it should be treated as obsolete. This mechanism allows multiple patrons to read the same table simultaneously without ever needing to lock each other out. | Field | Size | Technical Description | | :----------- | :------ | :-------------------------------------------------------------------- | | `t_xmin` | 4 bytes | The Transaction ID (XID) that inserted (packed) this tuple. | | `t_xmax` | 4 bytes | The Transaction ID that marked this tuple for disposal (deletion). | | `t_cid` | 4 bytes | The specific command within the transaction that acted on it. | | `t_ctid` | 6 bytes | The physical location (Page #, Offset) of this tuple or its successor.| | `t_infomask` | 2 bytes | Bitflags indicating transaction statuses and tuple constraints. | > [!NOTE] > The `t_ctid` acts as a physical pointer defined as `(blockNumber, offsetNumber)`. If a row is updated, Postgres doesn't move the tuple; it packs a new one and updates the old `t_ctid` to point to the new physical location. This chain of pointers allows Postgres to follow the MVCC history to the newest version of reality. #### System Columns Because of the engine's architectural requirements, every tuple carries hidden **System Columns** that you did not explicitly define in your `CREATE TABLE` statement. These columns are necessary for Postgres to manage visibility and physical location. If you need to inspect these fields for debugging or performance tuning, you can explicitly request them in your query: ```sql -- Asking for the hidden passengers SELECT ctid, xmin, xmax, name FROM animals WHERE name = 'Babu'; ``` Postgres will return something like this: | ctid | xmin | xmax | name | | :---- | :--- | :--- | :--- | | (0,1) | 501 | 0 | Babu | * **`ctid (0,1)`**: This is the literal physical address within the table: **Page 0, Offset 1**. * **`xmin 501`**: This is the ID of the transaction that "packed" this suitcase (inserted the row). * **`xmax 0`**: Since this is 0, it means nobody has "marked" this suitcase for deletion yet. It is a live, happy row! #### The Speeding up the Second Read (Hint Bits) When a transaction commits, it updates the **[[Manuscript/05 - Durability & Transactions/5.4 - Transactions (The Atomic Seal)|CLOG (Commit Log)]]**, a master registry of completed transactions. However, Postgres does not immediately go back and update every physical tuple that transaction touched—that would cause massive I/O overhead. Instead, the status of the suitcase is updated on-demand. The first person to read a suitcase after a transaction finished has to do a "transaction check" against the CLOG. Once they confirm the suitcase is committed, they "hint" the result by flipping a bit in the **`t_infomask`**. > [!TIP] > **Performance Logic**: This is why the very first `SELECT` after a massive data load often feels sluggish. Postgres is busy writing "Hint Bits" to thousands of suitcases. Subsequent reads find these hints and skip the check, which is why your second search is always lightning fast. #### The Invisible Handshake (Heap Only Tuples - HOT) In the standard MVCC model, every time you update a row, every single **[[Manuscript/03 - Access Paths & Indexing/3.0 - Indexes (The Mighty Shortcuts)|Index]]** pointing to that row must also be updated. This creates massive "Index Bloat." To prevent this, Postgres uses **HOT (Heap Only Tuples)**. If the following two conditions are met: 1. The new version of the suitcase fits on the **same 8KB page** as the old one. 2. None of the columns currently indexed were changed. Then Postgres performs an "Invisible Handshake." Postgres links the old tuple directly to the new one using the **`t_ctid`** pointer *inside the page*. The indexes continue to point to the old tuple, and the planner simply follows the internal chain to the new data. No index update required! ### 2. The Null Bitmap If a table contains nullable columns, Postgres avoids wasting space on those empty fields. Instead, it maintains a **Null Bitmap** immediately following the tuple header. If a bit is set to 0, the engine knows the corresponding attribute is NULL and skips over it entirely during the de-serialization process. **NULLs in Postgres are essentially free**—they consume almost no physical space within the tuple payload. ### 3. Memory Alignment (Padding) Postgres requires data to be aligned on specific memory boundaries (usually 8-byte chunks) for efficient CPU access. If the total size of the Tuple (Header + Payload) does not end on a clean boundary, the engine inserts **Padding Bytes** (zeros) into the tuple to maintain alignment. > [!WARNING] > A table with `(int, char(1), int)` takes up more space than `(int, int, char(1))` because Postgres must add padding to align the `int` fields on 4 or 8-byte boundaries. > > **The Strategy**: Keep your large items (`bigint`, `timestamp`) together at the start, and your smaller items (`boolean`, `smallint`) together at the end. This reduces "fragmentation" and keeps your suitcases lean. What happens when an attribute is too large to fit within a standard page (typically > 2KB)? Instead of allowing a single record to bloat a page, Postgres offloads the oversized data to a side-table—the **TOAST table**—and leaves a 16-byte "claim check" (a pointer) in the original tuple. TOAST (**The Oversized-Attribute Storage Technique**) is the mechanism that allows you to store large strings and JSON blobs without breaking the 8KB [[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Page]] limit. It is the art of delegation. --- | [[Manuscript/02 - Physical Storage & MVCC/2.1 - Data Types (Knicks, knacks, bits, and bobs)|2.1 Data Types (Knicks, knacks, bits, and bobs)]] | [[Manuscript/00 - Introduction/Index|Home]] | [[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|2.3 The Page (The Shipping Container)]] | --- ## 2.3 The Page (The Shipping Container) ![[assets/arch_page_container.png|450]] If Postgres had to request every single tuple from the storage layer individually, the I/O overhead would be catastrophic. To survive the sheer volume of data, Postgres organizes its physical files into fixed-size units called **Pages** (or Blocks). By default, every Page is exactly **8KB**, acting as a rigid, standardized **Shipping Container**. Think of it this way: Postgres can only lift 8KB at a time. Whether the container is full of a thousand tiny tuples or a single large record, the engine always interacts with the disk in these 8KB increments. This standardized "Shipping Container" allows Postgres to optimize I/O performance by aligning with the block size of the underlying operating system and hardware. ### The 8KB "Sweet Spot" You might wonder why Postgres settled on exactly 8KB. Why not as small as a peanut hull or as large as a full-size freight train? If we look at the extremes, the brilliance of the 8KB standard—the "Goldilocks" of storage—becomes clear. ### The Naive Model: The Cost of Extremes - **If Pages were Tiny (e.g., 512 bytes)**: To read a 1MB table, Postgres would have to make 2,048 separate requests to the operating system. The "Chatter" (system call overhead) would be catastrophic, and more space would be spent on Headers than on actual data. - **If Pages were Huge (e.g., 1 megabyte)**: If you wanted to read a single 100-byte animal record, Postgres would be forced to lift 1MB of data into memory. This leads to **Wasted I/O** and quickly clogs the **[[Manuscript/06 - Resource Management & Processes/6.2 - Shared Buffers (The Warming Rack)|Shared Buffers]]**. Even worse, a power failure mid-write (a **Torn Page**) could corrupt 1MB of your world at once. ### The Real Model: The 8KB Compromise Postgres chooses 8KB because it is perfectly balanced for modern hardware: 1. **Hardware Alignment**: Most modern disks and SSDs use 4KB physical sectors. An 8KB page fits exactly into two physical sectors, ensuring the engine never performs "Sub-block" writes—which are slow, complex, and risky. 2. **OS Harmony**: Linux and other OSes typically manage memory in 4KB pages. By matching this rhythm, Postgres ensures that its internal warming racks (Shared Buffers) work in perfect sync with the underlying kernel. 3. **B-Tree Shallowing**: 8KB is just enough space to store hundreds of "pointers" to other pages. In **[[Manuscript/03 - Access Paths & Indexing/3.1 - B-Tree (The Balanced Bookshelf)|Chapter 3.1]]**, we will see that this high "fan-out" keeps our search trees shallow—allowing Postgres to find a single needle in a billion-row haystack in just 3 or 4 "hops." This standardized footprint is what keeps the Cafe running smoothly, ensuring that every trip to the disk is exactly as heavy as it needs to be, and not a byte more. ### The Page Layout The internal structure of a Postgres page is designed to maximize contiguous free space and allow for on-page defragmentation. The page utilizes a **Pincer Growth** strategy. The **Item Identifiers** (pointers) grow from the front of the page towards the back, while the actual **Tuples** are written from the back of the page towards the front. They grow toward each other into the shared **Free Space**. By maintaining a single, contiguous gap in the middle of the page rather than scattered holes, Postgres can easily accommodate new records and optimize for memory alignment. ### The Schematic ```text +---------------------------------+ <-- Page Start (Offset 0) | PageHeaderData (24 bytes) | (Control metadata) +---------------------------------+ <-- pd_lower | Item Identifiers (4 bytes ea)| (Line pointers) | [1] [2] [3] [4] ... | (Grow DOWN ↓) +---------------------------------+ | | | FREE SPACE | (Contiguous space) | | +---------------------------------+ <-- pd_upper | | | TUPLES (Data) | (Physical records) | ... [4] [3] [2] [1] | (Grow UP ↑) | | +---------------------------------+ | Special Space (Optional) | (Index internal pointers) +---------------------------------+ <-- Page End (8,192 bytes) ``` ### The Parts of the Page 1. **The Page Header (`PageHeaderData`):** At the beginning of every page is a fixed 24-byte header. It records the page version, a checksum, and the memory offsets (`pd_lower` and `pd_upper`) that define the current boundaries of the free space. Crucially, it also contains the **`pd_lsn`**, which stores the **Log Sequence Number (LSN)** of the last Write-Ahead Log (WAL) record that modified this page. This ties the physical page state to the database's recovery diary (covered in detail in **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Pocket Diary)|Chapter 5.1]]**). 2. **The Item Identifiers (`ItemIdData`):** These act as an indirection layer. Each 4-byte identifier acts as a pointer to the physical offset of a tuple within the page. As new tuples are added, new identifiers are created, moving the `pd_lower` pointer deeper into the page. > [!NOTE] > **Why the Indirection?**: This layer of indirection is vital for performance. It allows Postgres to "defragment" the page (moving tuples around to reclaim space) without changing the tuple's logical ID (the Item Identifier index). The outside world (indexes and table scans) continues to reference the Item ID, oblivious to the tuple's physical migration within the page. 3. **The Tuples:** The actual data records are persisted starting from the end of the page and moving toward the front. As new tuples arrive, they are placed in the next available space, shifting the `pd_upper` pointer backward. 4. **The Special Space:** At the end of the page is an optional area used primarily by **Index Pages** to store metadata like sibling pointers (to the left and right pages in a B-Tree), enabling fast transversal without returning to the index root. 5. **The Free Space:** This is the contiguous gap between `pd_lower` and `pd_upper`. When these two pointers meet, the page is considered full. No further tuples can be inserted until **The Autovacuum** reclaims space by removing tuples marked as dead. ### The Integrity Check (Checksums) To ensure the "Shipping Container" hasn't been damaged while sitting in the Filing Cabinet, Postgres optionally uses **Page Checksums**. When a page is written to disk, Postgres calculates a mathematical signature based on its contents and stores it in the **`pd_checksum`** field of the header. When the page is read back into memory, Postgres re-calculates the signature. If they don't match, Postgres sounds the alarm (a data corruption error) rather than returning garbage to the user. ### The Half-Written Container (Torn Pages) If the power dies while the OS is in the middle of writing an 8KB page to the disk, you might end up with a **Torn Page**—a container that is half-full of new data and half-full of old data. This page is physically corrupted and unusable. To prevent this, Postgres uses **Full Page Writes (FPW)**. The first time an 8KB page is modified after a **[[Manuscript/05 - Durability & Transactions/5.2 - Crash Recovery (The Recovery Parade)|Checkpoint]]**, Postgres doesn't just record the change in its recovery diary—it copies the **entire 8KB container** into the WAL file. If a crash occurs, Postgres ignores the "torn" page on the disk and restores the clean, 8KB version from the diary. ### The Inspection Tool (`pageinspect`) If you want to inspect these physical structures directly, you can use the **`pageinspect` extension**. This tool allows you to examine the binary layout of a page from within a SQL session. ```sql -- First, enable the extension CREATE EXTENSION IF NOT EXISTS pageinspect; -- Peek at the top-level stats of Page 0 of the 'animals' table SELECT * FROM page_header(get_raw_page('animals', 0)); -- Results: -- lsn | checksum | flags | lower | upper | special | pagesize -- ------------+----------+-------+-------+-------+---------+---------- -- 0/16A5E88 | 0 | 0 | 28 | 8160 | 8192 | 8192 ``` In the output above, you can see the **`lsn`** (the last diary entry), the **`pagesize`** (exactly 8192 bytes, or 8KB), and the **`lower`** and **`upper`** pointers that define the Free Space. It’s the ultimate proof that the shipping container is real! --- ## 2.4 The Table (The Relation) ![[assets/arch_table_depot.png|450]] While the **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Page]]** is Postgres's physical I/O unit, the **Table** (or Relation) provides the logical organization and schema mapping. Think of the Table as the massive architectural structure where all our shipping containers are stacked and organized. ### The Logical Blueprint A table is defined by its **Schema**. This is the rigid, architectural blueprint that specifies which types of data are permitted in each column. Every tuple that enters the table must strictly adhere to this blueprint, or Postgres will reject it at the gate. This rigid structure is what allows Postgres to be efficient. Because the database knows the exact size and type of every column, it can calculate physical offsets with mathematical precision, jumping directly to the data it needs without having to "read" every byte in between. ```sql -- Defining the schema for the species table CREATE TABLE species ( id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, name TEXT UNIQUE NOT NULL, diet_type diet_category NOT NULL ); ``` Once the schema is defined, Postgres allocates the initial data files required to persist incoming 8KB pages onto the disk. ### Physical Segmentation (Segments) A table is physically represented as a series of 8KB pages stored in a file. However, most modern filesystems get grumpy at extreme file sizes. To manage this, Postgres imposes a strict **1GB threshold** for any single physical file. If a table grows beyond 1GB, Postgres simply starts a new **Segment** file directly next door (e.g., `16385`, `16385.1`, `16385.2`). To the end-user, the table is a single logical entity; at the storage layer, it is a multi-file sequence of 1GB segments. You can inspect the literal file path of a relation on disk: ```sql -- Where is the 'animals' relation located on my disk? SELECT pg_relation_filepath('animals'); -- Output: base/13593/16385 ``` ### Table Forks: The Side Rooms A relation is not merely a single list of data pages. For performance and management, Postgres separates different functional requirements into distinct files called **Forks**, or specialized **Side Rooms** in the table structure. The primary data is stored in the **Main Fork**. However, Postgres maintains supplementary structures to optimize access: - **Free Space Map (`_fsm`)**: A quick sketch showing which pages have empty space, allowing Postgres to quickly locate a target for new tuples. - **Visibility Map (`_vm`)**: A bitmask indicating which pages hold ONLY fully alive, completely visible records. This is critical for **Index-Only Scans**; if Postgres knows the whole page is safe, it doesn't even have to read the table data to check visibility rules! > [!TIP] > **Segments vs. Forks**: Segments are continuations of the same file sequence (more space), while Forks are entirely different files serving specific functional requirements (like visibility tracking). ### Access Patterns: The Sequential Scan In the absence of an index, Postgres must perform a **Sequential Scan** (Full Table Scan). It begins reading at the first page of the relation and proceeds linearly through every segment until the predicate is satisfied or the end of the file is reached. For large relations, this results in significant I/O latency and CPU overhead as every page must be loaded into memory and processed. ### Storage Reclamation (The Autovacuum) Because data is abandoned rather than erased, tables naturally accumulate "dead" tuples over time—a state known as **Bloat**. To reclaim this space, the **Autovacuum** background process scans pages to identify dead tuples and mark their space as reusable in the **Free Space Map**. This prevents the physical file size from growing indefinitely. In the next chapter, we will look at **Indexes**—the specialized data structures that allow Postgres to bypass the sequential scan and access targeted tuples with minimal I/O. --- ## 2.5 MVCC (The Sharpie Ledger) ![[assets/arch_mvcc_sharpie_v3.png|450]] Postgres adheres to a fundamental architectural principle: data is never mutated in-place. When a record is updated or deleted, Postgres does not overwrite the existing bytes on the data page. Instead, it utilizes an append-only **Sharpie Ledger**—marking the old version as obsolete and writing a newly versioned record elsewhere in the heap. At the architecture layer, this mechanism is known as **MVCC** (Multi-Version Concurrency Control). By persisting historical versions of data, Postgres avoids the massive performance penalty of "in-place" updates—which would require locking the entire table while rewriting the physical disk. This strategy ensures that **Readers never block Writers, and Writers never block Readers.** A reader can continue to look at an old version of a suitcase while a writer is busy packing the new one in a different container. ### The Art of Abandonment When you execute an UPDATE statement, Postgres performs a silent swap: 1. **Expire the current version**: The existing physical Tuple (the 'suitcase') is left in its current location, but its header is modified. Postgres marks the suitcase with a thick red Sharpie (**`xmax`**), signaling that future transactions should ignore it. 2. **Write the new version**: A fresh physical suitcase is packed and placed in the next available space in the table heap. This new tuple carries the current Transaction ID in its **`xmin`** (The Packer) field. 3. **The Metadata Stamp**: The **`ctid`** (physical address) of the old suitcase is updated to point to the address of the new one, creating a trail of clues that leads to the newest version. The result is architectural peace: **Readers never block writers, and writers never block readers.** Concurrent transactions reading the table will continue to see the old version of the record until their own snapshot determines it has officially "expired" (committed). ### Watching the Sharpie in Action Let’s watch Postgres handle an order update in our cafe. Imagine **Babu the Elephant** has a pending order: ```sql -- Check the current state of an order SELECT ctid, xmin, xmax, status FROM orders WHERE id = 1; -- Results: -- ctid | xmin | xmax | status -- -------+------+------+--------- -- (0,5) | 600 | 0 | Pending ``` Now, the chef finishes the dish, and we update the status: ```sql UPDATE orders SET status = 'Served' WHERE id = 1; -- Now look again... SELECT ctid, xmin, xmax, status FROM orders WHERE id = 1; -- Results: -- ctid | xmin | xmax | status -- -------+------+------+-------- -- (0,6) | 601 | 0 | Served ``` **Wait, what happened to (0,5)?** If we could look at the "hidden" deleted rows, we'd see that `(0,5)` now has an `xmax` of `601` (the transaction that replaced it). The old suitcase is still there, scribbled on with the invisible Sharpie. How very civil! > [!NOTE] > **XID Logic: The Sequential Diary**: You can think of Transaction IDs (XIDs) as ever-increasing numbers. By comparing a tuple's `xmin` (created) and `xmax` (deleted) against the current transaction horizon, Postgres can instantly know if a record was packed "before" or "after" now. ### The Physical Reality (Before & After) Let's look at what Postgres actually sees in the database. Imagine we have an order for **Babu the Elephant** in our `orders` table. #### State 1: The Initial Insert The moment an order is added (Transaction 100), the table looks like this: | ctid | xmin | xmax | id | animal_id | status | Visibility | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | **(0,1)** | **100** | **0** | 1 | 5 | Pending | **Visible to All (XID 100+)** | *Notice that **xmax is 0**. This means the order is alive and well!* #### State 2: After an UPDATE Now, the chef serves the dish and we update the status (Transaction 101). Postgres **leaves the old row** and **writes a new one**: ```sql UPDATE orders SET status = 'Served' WHERE id = 1; ``` Now, the physical table heap has **two** rows for the same order: Postgres uses the **`xmax`** of the first row to determine that it is no longer visible to new patrons. To the outside world, there is only one "Order #1." But in the database, the history of the world is written in **Sharpie-marked records**! ### The Ghost of a ROLLBACK Now, imagine a transaction creates a new row but suddenly aborts (**ROLLBACK**). Because Postgres is append-only, it *still wrote the physical Tuple to the disk* before the transaction aborted! | ctid | xmin | xmax | id | animal_id | status | Visibility | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | **(0,3)** | **102** | 0 | 2 | 8 | Salad | **Invisible to All (Aborted)** | Even though this transaction aborted, the physical tuple **(0,3)** remains in the database heap. This data is effectively **Invisible Ink**—no one can see it, but it still consumes physical space in the 8KB page. No future transaction will ever include this tuple in a result set, but it consumes space until the **[[Manuscript/06 - Resource Management & Processes/6.3 - The Housekeepers (Vacuum & Freezing)|Autovacuum]]** identifies and reclaims it. ### The Cost of Garbage This "Append-Only" lifestyle makes Postgres incredibly fast and consistent, but it leaves the table heap covered in old, crossed-out suitcases. In technical terms, we call this **Bloat**. Doing the least work possible today (just crossing items out) ensures that Postgres doesn't have to wait around for erasers, even if it requires the Autovacuum background process to help with the cleanup tomorrow. --- ## 2.6 TOAST (The Separate Trailer) ![[assets/toast_dinosaur.png|450]] Postgres maintains a rigid 8KB page size, but modern workloads often require storing attributes (such as JSONB blobs or large text fields) that significantly exceed this limit. Instead of throwing an error, Postgres quietly delegates the problem using **TOAST** (The Oversized-Attribute Storage Technique). Think of TOAST as a **Separate Trailer**. Instead of trying to force a 10MB record into an 8KB Page, Postgres intercepts the oversized attribute, compresses it, and moves it to a dedicated side-table. In the original Tuple, Postgres leaves behind a tiny, 18-byte **[[Architecture/TOAST|Claim Check]]** (a pointer) that allows it to reconstruct the full attribute only when requested. Postgres handles TOAST automatically so the user never has to worry about the physical constraints of the disk. When a field exceeds roughly **2KB**, Postgres reaches for its "delegation" strategy: 1. **Compression**: Postgres first attempts to compress the data using a built-in compression algorithm (PGLZ or LZ4). If the compressed data fits comfortably in the physical page, the job is done. 2. **External Storage**: If the compressed data is still too large, it is split into chunks and moved to the specialized TOAST table (the Separate Trailer). 3. **Indirection**: The original tuple is updated with a **Claim Check** (an External Pointer), which includes the OID of the TOAST table and the specific identifier for the chunks. When you query the data, Postgres uses the Claim Check to fetch the contents of the trailer and present them to you as a single, coherent record. You can verify if a table has an associated TOAST table by inspecting the `pg_class` system catalog: ### The Three Styles of Packing | Policy | Metaphor | Technical Behavior | | :----------- | :---------------------- | :---------------------------------------------------------------------------------------- | | **PLAIN** | **Strict Suitcase** | No TOAST allowed. If the row exceeds 8KB, Postgres throws an error. | | **EXTENDED** | **Compress & Trailer** | Attempts compression first; if still too large, moves data to the trailer. (Default) | | **EXTERNAL** | **Straight to Trailer** | Moves data to the trailer immediately without attempting compression. | > [!NOTE] > The **MAIN** strategy attempts to compress the data but keeps it in-line within the original tuple as long as possible, only moving it out-of-line if there is absolutely no other way to fit the row into the page. ### Query Performance & The Dinosaur Penalty TOAST is a performance optimization for scanning, as it keeps the **Main Fork** of the table small. Postgres can scan through the table at high speed because it only has to look at the large, TOASTed attributes when you explicitly ask for them. However, this creates a hidden penalty for broad queries: > [!WARNING] > **The `SELECT *` Penalty**: Think of `SELECT *` as asking Postgres to fetch every single attribute, including the giant **Dinosaur** in the trailer, even if you only wanted to check the metadata. > > Every time you ask for the TOASTed column, Postgres has to walk to the trailer, tape the fragmented chunks together, and carry it back. Keep your queries specific to maintain high performance.