# Chapter 2: Physical Storage & MVCC
## 2.0 - Storage Foundations (The Building Blocks of Storage)
<img src="assets/chap_1_storage_hierarchy.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
In the clean, abstract world of application logic, a query is just information. But to Postgres, data is a physical reality. It has a location on a disk, a specific byte size, and an associated cost to retrieve.
### What You'll Learn
- How Postgres physically organizes data into **Tuples**, **Pages**, and **Relations**
- Why the engine uses an append-only architecture instead of in-place updates
- How **MVCC** enables concurrent reads without locking
- What **TOAST** does when data exceeds the 8KB page limit
Because writing and reading from persistent storage is one of the most computationally expensive operations a database performs, Postgres avoids unnecessary disk I/O at all costs. To achieve this, it relies on a strict physical hierarchy, organizing its data into standardized 8KB blocks known as **Pages**.
### The Logical vs. The Physical
A common misconception is that the database operates directly on "rows" and "columns," as if managing a giant in-memory spreadsheet. In reality, a row is a logical abstraction—a clean, version-less representation of data used by the application.
Deep in the storage layer, Postgres manages **Tuples** (the physical records on disk) packed into **Pages** (8KB blocks of memory and disk).
> [!NOTE] The Click
> **Concept**: A row is logical; a tuple is physical.
> **Payoff**: A row is the version-less abstraction that humans see. A tuple is a physical byte payload on disk wrapped in transaction headers. Because Postgres never edits data in-place, a single logical row is often represented by multiple physical tuples (old, current, and deleted versions) scattered across different disk pages.
### The Append-Only Engine
Postgres uses an **Append-Only** architecture for primary storage. In the physical world, Postgres never updates data in place. When you run an update, the engine marks the old tuple as "dead" (obsolete) and writes a brand new version to a fresh location in the **Heap** (the unordered physical file where table data lives).
This Multi-Version Concurrency Control (MVCC) ensures that readers never block writers, and writers never block readers, avoiding the heavy coordination cost of locking.
### The Storage Hierarchy
To minimize the cost of fetching data, the engine organizes storage into a disciplined hierarchy:
- **[[Manuscript/02 - Physical Storage & MVCC/2.1 - Data Types (Knicks, knacks, bits, and bobs)|Data Types]]** — The bits and alignment of individual facts.
- **[[Manuscript/02 - Physical Storage & MVCC/2.2 - Tuple (The Suitcase)|Tuples]]** — The physical realization of a row, wrapped in headers.
- **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Pages]]** — The 8KB "shipping containers" that move between disk and memory.
- **[[Manuscript/02 - Physical Storage & MVCC/2.4 - Relation (The Table)|Tables]]** — The collection of pages that form a logical entity.
- **[[Manuscript/02 - Physical Storage & MVCC/2.5 - MVCC (The Sharpie Ledger)|MVCC]]** — The visibility rules that govern concurrent access.
- **[[Manuscript/02 - Physical Storage & MVCC/2.6 - TOAST (The Separate Trailer)|TOAST]]** — The overflow valve for data that cannot fit in a page.
### The Result Set: Unpacking the Suitcase
When you submit a query, the engine scans these physical pages and produces a temporary projection called a **[[Structures/Result Set|Result Set]]**. This is where physical tuples are finally unpacked and presented back to you as weightless, logical rows, stripped of their bureaucratic headers. Dead tuples are left behind in the heap to be cleaned up later by the **Autovacuum** process.
---
## 2.1 - Data Types (Knicks, knacks, bits, and bobs)
<img src="assets/bits_types_collection.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
Postgres refuses to store abstract concepts. It needs to know exactly what size "box" to put things in before it can start packing.
In Postgres, every field is strictly defined by a **Data Type**. A type is essentially a very practical architectural contract that specifies two things:
1. **Storage Footprint**: How many bytes Postgres must set aside for this value in the physical record.
2. **Valid Operations**: What mathematical or logical actions can be performed on the data (for example, you can't multiply a `date` by `saffron`).
---
### The Architectural Contract
Think of data types as physical boundaries. If you try to force an 8-byte payload into a 2-byte slot, Postgres will reject it. If you put a 1-byte payload into an 8-byte slot, Postgres will shake its head at the waste.
Imagine storing a single `boolean` (1 byte) in a `bigint` column (8 bytes). This isn't just a waste of space—it significantly increases the I/O cost of every query. Because Postgres is "Strongly Typed," once a column is defined as an `integer`, it strictly rejects any input that does not conform to that contract. This structure allows the engine to optimize data layout and ensure high-speed retrieval.
To maintain performance, you must be aware of the storage overhead. If you frequently specify 8-byte types for data that fits in 2 bytes, you are unnecessarily bloating your records and reducing the number of records that fit within a single **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Data Page]]**.
---
### Standard Inventory
| Category | Technical Name | Physical Size |
| :--- | :--- | :--- |
| **Logic** | `boolean` | 1 byte |
| **Numeric** | `smallint` | 2 bytes |
| **Numeric** | `integer` | 4 bytes |
| **Numeric** | `bigint` | 8 bytes |
> [!NOTE]
> **The `text` vs. `varchar` implementation**: In Postgres, `text` and `varchar` use the same underlying storage representation (**`varlena`**). Both use a dynamic structure preceded by a small header (the "tag") that tells Postgres the content length. The only difference is that `varchar(n)` forces Postgres to perform an additional length check before storing; `text` skips this check.
---
### Alignment Padding
Physical records aren't packed perfectly end-to-end. To ensure high-speed memory access, the CPU relies on alignment. Modern processors fetch memory in standardized chunks (typically 32-bit or 64-bit words). If a data type like an 8-byte `bigint` is unaligned and spans across two words, the CPU is forced to execute two separate memory reads and combine the fragments using bit-shifts—a silent CPU latency tax. To avoid this, Postgres forces data types to start at specific byte boundaries corresponding to their size (e.g., a `bigint` must start on a multiple of 8).
If you pack a 1-byte `boolean` followed by an 8-byte `bigint`, Postgres inserts **7 bytes of alignment padding** so the `bigint` can align comfortably on its boundary.
You can observe this overhead directly using `pg_column_size()`:
```sql
-- A single boolean and a bigint: 1 + 7 (padding) + 8 = 16 bytes
SELECT pg_column_size(row(true, 1::bigint));
-- Two bigints and two booleans: 8 + 8 + 1 + 1 = 18 bytes (no padding between columns)
-- Note: the total row size will still be aligned to the MAX alignment of the types.
SELECT pg_column_size(row(1::bigint, 2::bigint, true, false));
```
#### Optimal Column Ordering: Optimizing for Space
Because of padding, the **order** of your columns determines the final physical weight of your data.
| Order Style | Example Sequence | Final Size |
| :--- | :--- | :--- |
| **Bad Packing** | `bool, bigint, bool, bigint` | 32 bytes |
| **Good Packing** | `bigint, bigint, bool, bool` | 24 bytes |
By grouping your heaviest types together at the start of your table, you can shrink your storage requirements by **25%** without losing a single bit of information.
---
### OIDs: Postgres's Secret Labels
Postgres doesn't identify types by their names internally. To the engine, "integer" or "timestamp" are just human-readable aliases. Internally, every object in the database is assigned a unique, 4-byte numeric label called an **OID (Object Identifier)**.
Think of an OID as the **Internal Serial Number** for code objects. Strings are heavy and expensive to compare in C; integers are light and constant. When the query planner needs to pass a type into a function, it doesn't pass the word "integer"—it passes the number **23**.
#### The Unified Catalog
Almost everything in Postgres has an OID:
- **Tables** (Relations) in `pg_class`.
- **Functions** in `pg_proc`.
- **Indexes** in `pg_index`.
- **Types** in `pg_type`.
By labeling every concept with a consistent numeric ID, Postgres can use the same indexing machinery to manage its own internal blueprints as it uses to manage your actual data.
> [!NOTE]
> **The Limit of the ID**: OIDs are "Cluster-Global" 32-bit integers, meaning they wrap around at ~4.2 billion. While this seems infinite, the risk of "Wraparound" is why Postgres retired OIDs for identifying user records, reserving them safely for the internal catalog blueprints.
---
---
## 2.1.1 - System Catalogs (The Engine Queries Itself)
<img src="assets/arch_system_catalogs.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
Up until now, we've described the engine as a machine with hidden gears — tuples packed into pages, types enforced by contract, OIDs assigned as secret labels. But Postgres has a strange architectural habit: it stores the blueprints for the machine *inside the machine itself*.
Tables, types, indexes, functions, schemas, and roles are not magical C-struct internals buried in the engine's source code. They are rows in system tables — tables that live in the `pg_catalog` schema and are queryable with ordinary SQL.
This is not a convenience feature. It is a foundational design decision. The engine manages its own infrastructure using the same relational model it uses to manage your data. The same indexing, the same visibility rules, the same query planner. Postgres does not have a separate metadata engine. It *is* its own metadata engine.
---
### The Mystery
Consider what happens when you run:
```sql
CREATE TABLE capybaras (
id INT,
name TEXT
);
```
You just created a thing. But where did Postgres *remember* this? Not philosophically — literally. There is no configuration file that was appended to. No XML registry. No hidden binary blob.
The answer:
```sql
SELECT relname, relkind, reltuples
FROM pg_class
WHERE relname = 'capybaras';
```
```text
relname | relkind | reltuples
------------+---------+-----------
capybaras | r | 0
```
The table you created became a **row**. In another table. That is the click.
---
### `pg_class` — The Ledger of Things
`pg_class` is the central registry of *every named object* in the database. Every table, index, sequence, view, materialized view, and TOAST table is a row in `pg_class`.
```sql
-- What kinds of objects live in pg_class?
SELECT relkind, count(*)
FROM pg_class
GROUP BY relkind
ORDER BY count DESC;
```
The `relkind` column tells you what category of object the row represents:
| `relkind` | Meaning |
| :--- | :--- |
| `r` | Ordinary table (relation) |
| `i` | Index |
| `S` | Sequence |
| `v` | View |
| `t` | TOAST table |
| `m` | Materialized view |
Every object that has a physical presence on disk — a file in `base/` — has a corresponding row here. The column `relfilenode` tells you which file it maps to. The column `relpages` tells you how many 8KB pages it currently occupies. The column `reltuples` is the planner's estimate of how many rows are in the table — the number the **Query Planner** uses to decide between a Sequential Scan and an Index Scan.
> [!NOTE]
> **This is why `ANALYZE` matters.** When you run `ANALYZE`, Postgres samples the table and updates `pg_class.reltuples` and `pg_class.relpages`. If these numbers are stale, the planner's cost model is working with outdated physics, and it will choose bad plans.
---
### `pg_attribute` — The Column Registry
The table exists. Fine. But where are its columns?
```sql
SELECT attname, atttypid::regtype, attnum
FROM pg_attribute
WHERE attrelid = 'capybaras'::regclass
AND attnum > 0 -- exclude system columns
AND NOT attisdropped -- exclude dropped columns
ORDER BY attnum;
```
```text
attname | atttypid | attnum
---------+----------+--------
id | integer | 1
name | text | 2
```
Every column in every table in the database is a row in `pg_attribute`. The `attrelid` foreign key points back to `pg_class.oid`. The `atttypid` foreign key points to `pg_type.oid`.
The physical column ordering — `attnum` — is what determines the byte layout inside each tuple. When the engine needs to extract column 2 from a tuple, it reads the type's `typlen` and `typalign` from `pg_type` to calculate the exact byte offset. This is why column ordering affects alignment padding, as we saw in **[[Manuscript/02 - Physical Storage & MVCC/2.1 - Data Types (Knicks, knacks, bits, and bobs)|2.1 Data Types]]**.
> [!TIP]
> **System columns live here too.** Remember `ctid`, `xmin`, `xmax` from the tuple header? They are rows in `pg_attribute` with negative `attnum` values. Try removing the `attnum > 0` filter and you'll see the engine's own bookkeeping columns appear.
---
### `pg_type` — The Type Registry
Why does `text` mean anything? Because every type in Postgres — built-in and user-defined — is a row in `pg_type`:
```sql
SELECT typname, typlen, typalign, typcategory
FROM pg_type
WHERE typname IN ('int4', 'text', 'bool', 'timestamptz');
```
```text
typname | typlen | typalign | typcategory
--------------+--------+----------+-------------
bool | 1 | c | B
int4 | 4 | i | N
text | -1 | i | S
timestamptz | 8 | d | D
```
The `typlen` column is the physical footprint: `4` means four bytes, `-1` means variable-length (a `varlena`), `-2` means null-terminated C string. The `typalign` column (`c`, `s`, `i`, `d`) tells the engine what byte boundary the type requires — the same alignment rules we covered in the padding discussion.
This is the table the OIDs from **[[Manuscript/02 - Physical Storage & MVCC/2.1 - Data Types (Knicks, knacks, bits, and bobs)|2.1]]** point to. When the engine sees `atttypid = 23` in `pg_attribute`, it looks up OID 23 in `pg_type` and finds `int4` — four bytes, integer alignment. The entire type system is a lookup table.
---
### `pg_namespace` — The District Map
Schemas in Postgres are organizational boundaries — they separate objects with the same name into distinct namespaces. Every schema is a row in `pg_namespace`:
```sql
SELECT nspname, nspowner::regrole
FROM pg_namespace
WHERE nspname NOT LIKE 'pg_toast%'
AND nspname NOT LIKE 'pg_temp%';
```
```text
nspname | nspowner
----------------+------------
pg_catalog | postgres
public | pg_database_owner
information_schema | postgres
```
When you write `SELECT * FROM animals`, Postgres resolves `animals` by searching your `search_path` — a list of schemas to check in order. The default is `"$user", public`, which means Postgres first looks for a schema matching your role name, then falls back to `public`.
This is why `public.animals` and `pg_catalog.pg_class` can coexist without collision. They live in different neighborhoods.
---
### `pg_proc` — The Recipe Cabinet
Functions and procedures are also rows. Every built-in function (`now()`, `count()`, `pg_relation_size()`) and every function you define lives in `pg_proc`:
```sql
SELECT proname, pronargs, prorettype::regtype
FROM pg_proc
WHERE proname = 'pg_relation_size';
```
This surprises people. They assume built-in functions are hard-coded C function calls. They are — but the *dispatch table* is relational. When the parser encounters `pg_relation_size('animals')`, it looks up the function by name in `pg_proc`, reads its argument types, return type, and implementation language (`prolangs`), and dispatches accordingly. User-defined SQL and PL/pgSQL functions use the same catalog row — the only difference is the `prolang` column.
---
### The Graph
Now step back and see the structure that has emerged:
```mermaid
erDiagram
pg_class ||--o{ pg_attribute : "has columns"
pg_attribute }o--|| pg_type : "has type"
pg_class }o--|| pg_namespace : "lives in schema"
pg_class ||--o{ pg_index : "has indexes"
pg_class }o--|| pg_class : "TOAST table"
```
Postgres did not build a separate metadata subsystem. It built a **relational model for itself**. The catalog tables reference each other through OID foreign keys, exactly as your application tables reference each other through integer foreign keys. The engine queries itself using the same planner, the same executor, and the same index infrastructure that it uses to answer your queries.
> [!IMPORTANT]
> **The Recursive Insight**: When Postgres needs to plan a query against `animals`, it first queries `pg_class` to find the table, `pg_attribute` to find the columns, and `pg_statistic` to read the distribution statistics. Those catalog queries are themselves planned and executed by the same engine. It is turtles — or rather, tuples — all the way down.
---
### The Punchline
When you type `\d animals` in `psql`, you might assume the client is calling a proprietary internal API. It is not. It is running SQL.
A simplified version of what `\d` actually executes:
```sql
-- 1. Find the table
SELECT c.relname, c.relkind, n.nspname
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'animals';
-- 2. Get the columns
SELECT a.attname,
pg_catalog.format_type(a.atttypid, a.atttypmod) AS data_type,
a.attnotnull
FROM pg_attribute a
WHERE a.attrelid = 'animals'::regclass
AND a.attnum > 0
AND NOT a.attisdropped
ORDER BY a.attnum;
-- 3. Get the indexes
SELECT c2.relname AS index_name,
pg_get_indexdef(i.indexrelid) AS definition
FROM pg_index i
JOIN pg_class c2 ON c2.oid = i.indexrelid
WHERE i.indrelid = 'animals'::regclass;
```
There is no magic. Every `\d`, every `\dt`, every `\df` in `psql` is an ordinary SQL query against `pg_catalog`. The engine does not have a secret API for introspection — it has tables.
> [!TIP]
> **See the queries yourself.** Run `psql` with the `-E` flag (`psql -E`) and every backslash command will print the SQL it executes before showing the result. This is the single fastest way to learn the catalog schema.
---
### The Remaining Catalogs
Two catalogs that we will encounter in later chapters deserve a brief mention:
- **`pg_constraint`**: every `PRIMARY KEY`, `FOREIGN KEY`, `UNIQUE`, and `CHECK` constraint is a row. We will use this in **[[Manuscript/03 - Access Paths & Indexing/3.5 - Constraints & Triggers (The Integrity Layer and the Chain Reaction)|Chapter 3.5]]**.
- **`pg_authid`**: every role (user, group) in the cluster is a row. We will use this extensively in **[[Manuscript/09 - Identity & Access Control/9.1 - Roles & Privileges (The Name Tags)|Chapter 9]]**.
The pattern is always the same: if Postgres knows about something, there is a catalog table where that knowledge lives as a row.
---
## 2.2 - Tuple (The Suitcase)
<img src="assets/arch_tuple_suitcase.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
A **Tuple** is the smallest physical unit of storage in Postgres. When you insert a logical "row" via SQL, Postgres physically writes a tuple to the disk.
Most heap tuples are stored as fixed, immutable hunks of bytes. They contain your user data, plus a metadata header.
Think of it as a **Suitcase**. The engine packs your data into it, slaps a routing label on the outside, and stores it in the system.
### The Immutable Update
> [!IMPORTANT] Prediction Checkpoint
> When you update a row in Postgres, it is tempting to assume the engine modifies the existing bytes on disk to reflect the change. But if another transaction is concurrently reading that exact row, can you safely edit those bytes in place without forcing the reader to wait?
Postgres does not modify the data in place. It utilizes an **Append-Only** architecture. When you run an `UPDATE`:
1. The engine marks the original tuple as obsolete.
2. It writes a completely new tuple with the new data.
3. It appends the new tuple to a fresh location in the **Heap** (the main table storage).
Because the old tuple is still physically there, other active queries can continue reading it. They do not have to wait for your update to finish. This design is the foundation of **Multi-Version Concurrency Control (MVCC)**.
> [!NOTE]
> **In PostgreSQL Terms**
> * **Tuple**: The physical realization of a row.
> * **Heap**: The unordered physical storage area where tuples are appended.
> * **MVCC**: The visibility model that allows concurrent reading of old versions.
### Tuple Anatomy
Standard heap tuples follow a fixed physical layout. They are divided into two sections: the **Header** (metadata) and the **User Data** (your columns).
#### 1. The Header (The Metadata)
Standard heap tuples begin with a fixed-size header (**23 bytes**). This header contains the status bits and transaction IDs that allow Postgres to manage concurrency without locking.
#### 2. The User Data (The Payload)
This is where your actual data lives. It follows the header and is laid out in the order defined in your `CREATE TABLE` statement.
To understand how Postgres makes these decisions at microsecond scale, we must look at the physical anatomy of the bytes on the wire:
```text
+---------------------------------+ <-- Tuple Start
| HEADER (23 bytes) | (xmin, xmax, ctid, flags)
+---------------------------------+
| NULL BITMAP (Optional) | (Null Presence map)
+---------------------------------+
| [ ALIGNMENT PADDING ] | (Memory Alignment)
+---------------------------------+
| |
| USER DATA | (Payload)
| [Slot 1] [Slot 2] ... |
| |
+---------------------------------+ <-- Tuple End
```
### Lab: Inspecting the Physical Reality
To see the physical reality of a tuple, we can peek at the "system columns" that Postgres attaches to every row. Let's welcome **Cashew the Capybara** to the cafe:
```sql
INSERT INTO animals (name, species_id)
VALUES ('Cashew', (SELECT id FROM species WHERE name = 'Capybara'));
SELECT ctid, xmin, xmax, name
FROM animals
WHERE name = 'Cashew';
```
**STOP.**
Let's inspect a real tuple.
```text
ctid | xmin | xmax | name
---------+------+------+--------
(73,75) | 850 | 0 | Cashew
```
Look:
- **`xmin`** = creator (Transaction 850 created the row)
- **`xmax`** = nobody (0 means no transaction has deleted or updated it yet)
- **`ctid`** = address (Page 73, Slot 75 is where Cashew lives)
That's it. You can now read tuple headers.
> [!NOTE] The Click
> **Concept**: Rows aren't what Postgres stores. Tuples are.
> **Payoff**: Rows are what humans see; tuples are what the engine manipulates. That's why MVCC, vacuum, hint bits, and visibility all live on tuples—not rows.
By inspecting these headers, the engine determines precisely where the tuple is located and whether it is "visible" to your current transaction.
#### Bypassing the Ledger: Hint Bits & The Commit Log (CLOG)
To determine visibility, the engine must know if Transaction 850 actually committed. It does this by checking the **Commit Log (CLOG)**—a global transaction status map (stored in `pg_xact`). However, consulting a global table for every single tuple on a table scan would create a severe bottleneck.
To bypass this lookup, Postgres uses **Hint Bits**. The first time a query reads a page and checks a tuple's status in the CLOG, it stamps the status directly into the tuple's header flag (`t_infomask` bits: `COMMITTED` or `ABORTED`). Subsequent queries read these flags directly from the tuple, ignoring the CLOG entirely.
> [!NOTE]
> **The Select Write Tax**: Because setting hint bits requires writing to the page, even a read-only `SELECT` query can occasionally mark pages as "dirty," forcing the engine to write them back to disk later.
This allows Postgres to manage history without erasers, though it eventually requires a "Vacuum" to reclaim space from obsolete versions.
---
---
### The Tuple Lifecycle
Because a tuple is the physical realization of a row, its lifecycle is governed by two major architectural disciplines:
1. **[[Manuscript/02 - Physical Storage & MVCC/2.2.1 - Visibility & System Columns|Visibility & System Columns]]**: How the engine uses hidden headers (`xmin`, `xmax`, `ctid`) to decide which transactions can see which version of a row.
2. **[[Manuscript/02 - Physical Storage & MVCC/2.2.2 - Storage Optimizations|Storage Optimizations]]**: how the engine packs these tuples efficiently using Null Bitmaps, Alignment, and specialized techniques like **HOT** and **TOAST**.
---
## 2.2.1 - Visibility & System Columns
<img src="assets/arch_visibility_passport.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
Every tuple in the **Heap** carries a set of hidden markers that determine if a specific query is allowed to see it. These markers are the engine's primary tool for managing concurrency without locking.
In the Elephant Cafe, we call these the **tuple attributes**. They tell the engine where the tuple came from, who owns it, and whether it has been logically retired.
### The System Columns
There are three critical "System Columns" embedded in every tuple's header. You cannot see them in a `SELECT *`, but they are always there, guiding the engine's visibility logic.
| Field | Size | Role |
| :--- | :--- | :--- |
| **`xmin`** | 4 bytes | The Transaction ID (XID) that **inserted** this tuple. |
| **`xmax`** | 4 bytes | The Transaction ID that **deleted** or updated this tuple. |
| **`ctid`** | 6 bytes | The physical location **(Page, Offset)** of this tuple. |
To truly understand visibility, you must see it in action. Let's look at the physical identity of Babu the Elephant. Even though he looks like a weightless row in your application, he has a very specific physical coordinate and history.
```sql
SELECT ctid, xmin, xmax, name
FROM animals
WHERE name = 'Babu';
```
| ctid | xmin | xmax | name |
| :---- | :--- | :--- | :--- |
| (0,1) | 501 | 0 | Babu |
#### Decoding the Attributes
- **`ctid (0,1)`**: This tuple lives on **Page 0**, at **Slot 1**. The `ctid` is the physical address the engine uses to jump directly to the data.
- **`xmin 501`**: This version of Babu was created by **Transaction 501**.
- **`xmax 0`**: The `xmax` is empty. This means no transaction has deleted or updated this tuple yet. It is currently "Alive."
### The Multi-Version Reality
Because Postgres uses **Append-Only** storage, an `UPDATE` is actually a two-step physical dance:
1. The engine sets the `xmax` of the old tuple to your current Transaction ID (marking it as dead).
2. The engine writes a brand-new tuple with a new `ctid` and sets its `xmin` to your current Transaction ID.
> [!IMPORTANT]
> **The Visibility Checkpoint**: Every query you run carries a **Snapshot**—a list of which transactions are finished and which are still in progress. Postgres compares your snapshot to the `xmin` and `xmax` of every tuple. If a tuple's `xmin` is from a finished transaction and its `xmax` is empty (or from a transaction that hasn't finished), you see it. Otherwise, it is invisible to you.
This simple integer comparison is what allows one patron to see the "old" price of Saffron while another patron is in the middle of updating it. No locks are required; the engine simply looks at the tuple attributes and decides what's real for you.
> [!NOTE] The Click
> **Concept**: A snapshot is not a copy of database state. It is just three numbers.
> **Payoff**: Because a snapshot is just boundaries (the oldest active XID, the next XID, and active XIDs in between), snapshot creation takes microseconds and occupies almost zero memory, regardless of how large the database is.
---
## 2.2.2 - Storage Optimizations
<img src="assets/arch_storage_opts.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
Once you understand the anatomy of a tuple and its visibility attributes, the question becomes one of efficiency. How does Postgres pack these suitcases as tightly as possible, and what happens when the data is simply too large to fit?
### Efficiency: Nulls and Alignment
Postgres is designed to minimize storage overhead within each tuple. To do this, it uses two primary techniques: the **Null Bitmap** and **Memory Alignment**.
#### The Null Bitmap (Free Space)
If a column is NULL, Postgres doesn't waste any space in the User Data section of the tuple. Instead, it flips a single bit in a specialized map in the header called the **Null Bitmap**.
- **The Payoff**: NULLs are virtually free. A table with 100 columns, 90 of which are NULL, takes up almost the same space as a table with only 10 columns.
#### Memory Alignment (The 8-Byte Rhythm)
The CPU is most efficient when reading data that starts on a "Natural Boundary" (usually 8 bytes). If a column ends on an odd byte, Postgres adds **Padding Bytes** to ensure the next column starts at the correct interval.
- **The Optimization**: By grouping fixed-width columns of the same size together (e.g., placing all `bigint` and `timestamp` columns at the start of the table), you can reduce the amount of padding required, potentially saving megabytes of space on a large table.
---
### Advanced Optimizations
For the extreme edges of performance, Postgres employs two specialized techniques to handle updates and large data.
#### 1. Heap-Only Tuples (HOT)
High-frequency updates can lead to "Index Bloat" as the engine creates new physical versions of rows. To mitigate this, Postgres uses **HOT Updates** to link new versions together on the same page without updating every index. We will explore the mechanics and performance implications of this in **[[Manuscript/06 - Resource Management & Processes/6.5 - Tuple Bloat (Garbage Collection)|Chapter 6.5 - Tuple Bloat (Garbage Collection)]]**.
#### 2. TOAST (The Separate Trailer)
Postgres has a hard limit: a tuple cannot exceed the size of a single **8KB Page**. But what if you need to store a 10MB JSON document or a high-resolution olfactory map?
The engine uses **TOAST** (The Oversized-Attribute Storage Technique). If a value is too large (>2KB), Postgres "TOASTs" it:
1. The large value is compressed and moved to a separate, hidden "TOAST table."
2. The original tuple is left with an **18-byte "Claim Check"** (a pointer) in its place.
3. When you query the row, the engine automatically fetches the data from the trailer and presents it to you as if it were right there in the suitcase.
> [!TIP]
> This is why Postgres can handle massive blobs of data without slowing down sequential scans. The engine only "unpacks" the large TOAST data if you explicitly ask for that specific column.
---
## 2.3 - The Page (The Shipping Container)
<img src="assets/arch_page_container.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
If Postgres requested tuples individually from disk, the I/O overhead would be catastrophic. Physical disks excel at moving large contiguous blocks, but they struggle with tiny, random requests.
To solve this, the engine organizes storage into fixed-size **Pages** (or Blocks). By default, every page is **8KB**.
Think of it as a **Shipping Container**. The engine moves containers, not individual items. Postgres reads and writes these 8KB chunks regardless of whether they hold one record or a thousand.
The engine does not fetch individual rows from disk; it fundamentally fetches the 8KB page that contains the row.
> [!NOTE]
> **In PostgreSQL Terms**
> * **Tuple**: The physical row record.
> * **Page**: The 8KB physical block that holds multiple tuples.
> * **I/O Unit**: Postgres fundamentally reads and writes in discrete 8KB blocks.
> [!NOTE] The Click
> **Concept**: Postgres never reads a row from disk. It reads the page.
> **Payoff**: If you query a single row from a table with a cold cache, Postgres fetches the entire 8KB page containing that row. Any subsequent queries for other rows on that same page will be instant cache hits in memory.
> **The Takeaway**: Pages—not rows—are PostgreSQL's fundamental unit of disk and memory transfer.
### Why 8KB? The Goldilocks Compromise
Postgres chooses 8KB as the optimal balance for modern hardware:
1. **Hardware Alignment**: Most SSDs use 4KB physical sectors. An 8KB page fits cleanly into two sectors, avoiding slow and risky partial-sector writes.
2. **OS Harmony**: Linux manages memory in 4KB pages; an 8KB rhythm keeps the buffer pool and storage stack in lockstep.
3. **B-Tree Efficiency**: 8KB is large enough for high "fan-out," keeping search trees shallow—often just three or four hops to find a row in a billion-row table.
In short: most trips to the disk are sized to align with the hardware's optimal I/O unit.
The internal structure of a Postgres page is designed to maximize contiguous free space. This allows for efficient on-page defragmentation.
> [!IMPORTANT] Prediction Checkpoint
> If you delete a few rows from the middle of an 8KB page, what does Postgres do to the physical space? Does it immediately slide the remaining rows together to keep the free space contiguous, or does it leave them fragmented? Consider the CPU cost of moving memory around on every deletion vs. the risk of leaving unusable gaps.
The page uses a dual-directional growth strategy. **Item Identifiers** grow from the front of the page toward the back, while **Tuples** are written from the back toward the front. They grow toward each other into a shared gap of **Free Space**.
When rows are deleted, Postgres does not immediately shift the remaining rows to compact the space—that would waste CPU cycles on every write. Instead, it lazily defragments the page only when a new tuple needs a contiguous slot that is larger than the current free space. By maintaining a single, contiguous gap in the middle, Postgres can easily accommodate new records while optimizing for memory alignment.
### The Schematic
```text
+---------------------------------+ <-- Page Start (Offset 0)
| PageHeaderData (24 bytes) | (Control metadata)
+---------------------------------+ <-- pd_lower
| Item Identifiers (4 bytes ea)| (Line pointers)
| [1] [2] [3] [4] ... | (Grow DOWN ↓)
+---------------------------------+
| |
| FREE SPACE | (Contiguous space)
| |
+---------------------------------+ <-- pd_upper
| |
| TUPLES (Data) | (Physical records)
| ... [4] [3] [2] [1] | (Grow UP ↑)
| |
+---------------------------------+
| Special Space (Optional) | (Index internal pointers)
+---------------------------------+ <-- Page End (8,192 bytes)
```
### The Parts of the Page
1. **The Page Header (`PageHeaderData`):** At the beginning of a standard page is a fixed 24-byte header. It records the page version, a checksum, and the memory offsets (`pd_lower` and `pd_upper`) that define the boundaries of the free space.
The header also contains a marker tying the physical page to the database's recovery history. This helps the engine determine precisely which WAL records have already been applied to this specific page.
2. **The Item Identifiers (`ItemIdData`):** These act as an indirection layer. Each 4-byte identifier points to the physical offset of a tuple within the page. As new tuples are added, identifiers grow forward, moving the `pd_lower` pointer deeper into the page.
> [!NOTE]
> **Why the Indirection?**: This layer of indirection is vital for performance. It allows Postgres to "defragment" the page (moving tuples around to reclaim space) without changing the tuple's logical ID (the Item Identifier index). The outside world (indexes and table scans) continues to reference the Item ID, oblivious to the tuple's physical migration within the page.
3. **The Tuples:** The actual data records are persisted starting from the end of the page and moving toward the front. As new tuples arrive, they are placed in the next available space, shifting the `pd_upper` pointer backward.
4. **The Special Space:** At the end of the page is an optional area used primarily by **Index Pages** to store metadata like sibling pointers (to the left and right pages in a B-Tree), enabling fast transversal without returning to the index root.
5. **The Free Space:** This is the contiguous gap between `pd_lower` and `pd_upper`. When these two pointers meet, the page is considered full. No further tuples can be inserted until the **Autovacuum** worker reclaims space by removing tuples marked as dead.
**The rule of thumb:** The Item Identifiers (pointers) and the Tuples (data) grow toward each other from opposite ends of the page. The page is "full" when they finally meet in the middle.
### Page Checksums
To detect silent data corruption — bit-rot in the storage layer, incomplete writes, cosmic rays — Postgres can maintain a checksum on its pages. When a page is written to disk, the engine computes a CRC over its contents and stores it in `pd_checksum`. When the page is read back, the engine recomputes the checksum and compares. A mismatch raises a data corruption error rather than silently returning garbage to the user.
If the power dies while the OS is in the middle of writing an 8KB page, the result is a **Torn Page**. This occurs because the underlying hardware often writes in smaller sectors (e.g., 512 bytes or 4KB). A crash leaves the page in a physically corrupted state, half-new and half-old.
Postgres prevents this with **Full Page Writes (FPW)**. The first time a page is modified after a **[[Manuscript/05 - Durability & Transactions/5.2 - Crash Recovery (The Recovery Parade)|Checkpoint]]**, Postgres copies the *entire 8KB block* into the WAL.
> [!TIP]
> While FPW prevents corruption, it causes significant **Write Amplification**. A 1-byte change to a tuple can trigger an 8KB write to the WAL. Tuning the `checkpoint_timeout` is the primary way to manage this overhead.
### Inspecting a Page (`pageinspect`)
You can peek at the actual physical layout of a page using the `pageinspect` extension:
```sql
CREATE EXTENSION IF NOT EXISTS pageinspect;
SELECT * FROM page_header(get_raw_page('animals', 0));
-- lsn | checksum | flags | lower | upper | special | pagesize
-- ------------+----------+-------+-------+-------+---------+----------
-- 0/16A5E88 | 0 | 0 | 28 | 8160 | 8192 | 8192
```
The output exposes the `lsn` (the last WAL record that touched this page), the `pagesize` (typically 8192 bytes), and the `lower`/`upper` pointers that bound the free space. This is the architecture of the page made visible from a SQL session.
---
## 2.4 - Relation (The Table)
<img src="assets/arch_table_depot.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
While the **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Page]]** is Postgres's physical I/O unit, the **Table** (or Relation) provides the logical organization and schema mapping.
### The Heap: A Study in Disorder
In the Elephant Cafe, a table is not an ordered list. It is a **Heap**—a physical file where data is stored with no inherent logical order. When you insert a new animal, Postgres does not try to find "the right spot" alphabetically; it simply looks for the first available gap in its shipping containers and tosses the tuple inside.
This "toss it in" approach is why writing to a table is so fast, but it’s also why searching it is so slow. Because tuples are scattered based on arrival time and available space (managed by the **Free Space Map**), the engine can never "guess" where a specific record lives.
### The Logical Blueprint
A table is defined by its **Schema**. This is the architectural blueprint that specifies which types of data are permitted in each column. Every tuple that enters the table must strictly adhere to this blueprint, or Postgres will reject it at the gate.
This structure is what allows Postgres to be efficient. Because the database knows the exact size and type of every column, it can calculate physical offsets with mathematical precision, jumping directly to the data it needs without having to "read" every byte in between.
```sql
-- Defining the schema for the species table
CREATE TABLE species (
id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT UNIQUE NOT NULL,
diet_type diet_category NOT NULL
);
```
Once the schema is defined, Postgres allocates the initial data files required to persist incoming 8KB pages onto the disk.
### Physical Segmentation (Segments)
A table is physically represented as a series of 8KB pages stored in a file. However, most modern filesystems get grumpy at extreme file sizes. To manage this, Postgres imposes a strict **1GB threshold** for any single physical file.
If a table grows beyond 1GB, Postgres simply starts a new **Segment** file directly next door (e.g., `16385`, `16385.1`, `16385.2`). To the end-user, the table is a single logical entity; at the storage layer, it is a multi-file sequence of 1GB segments.
You can inspect the literal file path of a relation on disk:
```sql
-- Where is the 'animals' relation located on my disk?
SELECT pg_relation_filepath('animals');
-- Output: base/13593/16385
```
### Table Forks: The Side Rooms
A relation is not merely a single list of data pages. For performance and management, Postgres separates different functional requirements into distinct files called **Forks**, or specialized **Side Rooms** in the table structure.
The primary data is stored in the **Main Fork**. However, Postgres maintains supplementary structures to optimize access:
- **Free Space Map (`_fsm`)**: A quick sketch showing which pages have empty space, allowing Postgres to quickly locate a target for new tuples.
- **Visibility Map (`_vm`)**: A bitmask indicating which pages hold ONLY fully alive, completely visible records. This is critical for **Index-Only Scans**; if Postgres knows the whole page is safe, it doesn't even have to read the table data to check visibility rules!
> [!TIP]
> **Segments vs. Forks**: Segments are continuations of the same file sequence (more space), while Forks are entirely different files serving specific functional requirements (like visibility tracking).
### Access Patterns: The Sequential Scan
In the absence of an index, Postgres must perform a **Sequential Scan** (Full Table Scan). It begins reading at the first page of the relation and proceeds linearly through every segment until the predicate is satisfied or the end of the file is reached.
For large relations, this results in significant I/O latency and CPU overhead as every page must be loaded into memory and processed.
### Storage Reclamation (The Autovacuum)
Because data is abandoned rather than erased, tables naturally accumulate "dead" tuples over time—a state known as **Bloat**.
To reclaim this space, the **Autovacuum** background process scans pages to identify dead tuples and mark their space as reusable in the **Free Space Map**. This prevents the physical file size from growing indefinitely.
---
## 2.5 - MVCC (The Sharpie Ledger)
<img src="assets/arch_mvcc_sharpie_v3.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
Postgres utilizes **MVCC** (Multi-Version Concurrency Control). When a record is updated, the engine does not overwrite existing data. Instead, it marks the original version as obsolete and appends a new version to the heap.
Think of it as a **Sharpie Ledger**: changes are never erased, only crossed out. Postgres enforces this using two transaction-ID markers in the tuple header: `xmin` (the creator) and `xmax` (the eraser).
By appending new versions instead of updating in-place, the engine sidesteps the need for heavy table-level locks. This ensures that readers never block writers, and writers never block readers.
> [!TIP]
> **The Key Benefit**: In Postgres, **readers never block writers, and writers never block readers.** A reader holds onto its consistent snapshot of the old version while a writer is busy committing a new one.
> [!NOTE]
> **In PostgreSQL Terms**
> * **Tuple**: The physical row version.
> * **Heap**: The unordered physical table storage.
> * **MVCC**: Multi-Version Concurrency Control (the visibility model).
---
### The Mechanics of Versioning
When you execute an `UPDATE`, Postgres performs a three-step logical swap:
1. **Expire the current version**: The physical **Tuple** remains in place, but its header is modified. Postgres stamps `xmax` with the current Transaction ID. This signals that future transactions should treat the row as deleted.
2. **Write the new version**: A fresh tuple is appended to the next available space in the heap. Its `xmin` is set to the current Transaction ID.
3. **Physical Linkage**: The `ctid` (physical address) of the old tuple is updated to point to the new one. This creates a version chain that the engine can walk to find the latest record.
```mermaid
graph LR
Tuple1["Tuple 1 (Old version)<br>ctid: (73,73)<br>xmin: 846<br>xmax: 847"] -->|t_ctid pointer| Tuple2["Tuple 2 (New version)<br>ctid: (73,74)<br>xmin: 847<br>xmax: 0"]
```
```mermaid
sequenceDiagram
participant W as Transaction 847<br/>(Writer)
participant DB as Postgres Heap
participant R as Transaction 848<br/>(Reader)
Note over DB: "[Tuple 1] xmin: 800, xmax: 0"
W->>DB: UPDATE animals SET name = 'Gilly'
Note over DB: "[Tuple 1] xmax: 847 (Marked)"
Note over DB: "[Tuple 2] xmin: 847 (New)"
R->>DB: SELECT * FROM animals
Note right of R: Snapshot: [Active: 847]
DB-->>R: Returns Tuple 1
Note right of R: Visibility: xmax (847) is in Active Set<br/>So Tuple 1 is still visible.
W->>DB: COMMIT
Note over R: Snapshot persists.<br/>Further reads still see Tuple 1.
```
Concurrent transactions see whichever version of the row matches their own visibility snapshot. They skip anything with an `xmin` greater than their own ID or an `xmax` that has already committed.
---
### 🧪 Lab Challenge: The Ghost in the Machine (Update Mechanics)
**The Request**: "Prove that Postgres doesn't update rows in place. Show me the 'Ghost' of the old version."
#### The Investigation
First, find a specific animal and note its physical address (`ctid`) and creator (`xmin`).
```sql
SELECT ctid, xmin, name FROM animals WHERE name = 'Glowy';
```
**Result**:
```text
ctid | xmin | name
---------+------+-------
(73,73) | 846 | Glowy
```
Now, perform an `UPDATE` and check the same columns:
```sql
UPDATE animals SET name = 'Glowing Gilly' WHERE name = 'Glowy';
SELECT ctid, xmin, name FROM animals WHERE name = 'Glowing Gilly';
```
**Result**:
```text
ctid | xmin | name
---------+------+---------------
(73,74) | 847 | Glowing Gilly
```
#### The Diagnosis
The `ctid` moved from `(73,73)` to `(73,74)`. The engine didn't change the data at the old address; it wrote a **brand new tuple** at a new address.
#### The Reward
To see the "Ghost" left behind, we can peek at the hidden `xmax` of the old record (requires a specific query to bypass the normal visibility filters):
```sql
-- Peeking at the expired version
SELECT ctid, xmin, xmax, name
FROM animals
WHERE ctid = '(73,73)';
```
**Result**:
```text
ctid | xmin | xmax | name
---------+------+------+-------
(73,73) | 846 | 847 | Glowy
```
The old version still exists on disk! Its `xmax` is now set to **847** (the ID of your update transaction). To your transaction, this row is a "Ghost"—it occupies physical space but is logically dead.
> [!NOTE]
> **Recap**: `UPDATE` = `INSERT` + `DELETE`. Every update creates a new physical record. The old one remains as "Bloat" until the **Autovacuum** reclaims the space.
---
### The Ghost of a ROLLBACK
Because Postgres is append-only, it writes the physical tuple to disk *before* the transaction officially commits. If a transaction creates a new row and then encounters a `ROLLBACK`, that physical data still lives in the heap.
| ctid | xmin | xmax | id | animal_id | status | Visibility |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **(0,3)** | **102** | 0 | 2 | 8 | Salad | **Invisible (XID 102 aborted)** |
No future transaction will ever see this record—the engine's visibility map knows that XID 102 aborted—but the bytes still occupy physical space in the 8KB page. These dead records are known as **Bloat**.
This design prioritizes write speed today by deferring the cost of cleanup. However, it leaves a performance debt that must be settled later. We cover the process of reclaiming this space in detail in **[[Manuscript/06 - Resource Management & Processes/6.4 - Vacuum & Freezing (The Housekeepers)|Chapter 6]]**.
---
---
## 2.6 - TOAST (The Separate Trailer)
<img src="assets/toast_dinosaur.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
Postgres is stubborn about its 8KB page limit. However, it must also handle large data types like 10MB JSON blobs. Instead of breaking the laws of storage physics, the engine delegates the problem.
That delegation is called **TOAST** (*The Oversized-Attribute Storage Technique*). Think of it as a **Separate Trailer** parked behind the table. When an attribute is too large for the 8KB main page, Postgres compresses it and moves it out of the main heap. It leaves behind an 18-byte pointer—a "claim check"—in the original tuple.
This design keeps the main heap compact and fast to scan. The trade-off is that retrieving the oversized column requires a second lookup.
When an attribute exceeds roughly **2KB**, the engine works through a delegation strategy:
1. **Compression**: Postgres first tries to compress the value using a built-in algorithm (PGLZ or LZ4). If the compressed value fits in the page, it stays inline.
2. **External Storage**: if compression isn't enough, the value is split into chunks and stored in a dedicated *TOAST table* associated with the parent table.
3. **Indirection**: the original tuple is updated with an external pointer that references the OID of the TOAST table and the chunk identifiers.
When you query the column, Postgres reassembles the chunks transparently. You see one logical record, while the engine fetches the chunks from external storage and reassembles them in memory.
### The Four Storage Strategies
Each column declares a TOAST strategy that controls how the engine handles oversized values:
| Strategy | Behavior |
| :----------- | :------------------------------------------------------------------------------------------------ |
| **PLAIN** | TOAST disabled entirely. If the row doesn't fit in 8KB, Postgres throws an error. |
| **EXTENDED** | Attempts compression first; falls back to external storage if still too large. **Default** for variable-length types. |
| **EXTERNAL** | Skips compression entirely and moves the value to external storage. Useful when the data is already compressed (e.g. JPEG, gzipped JSON). |
| **MAIN** | Attempts compression but keeps the value inline as long as it possibly fits. |
You can inspect (or override) the strategy per column with `ALTER TABLE ... SET STORAGE`.
### 🧪 Lab Challenge: TOASTing a Large Record
**The Request**: "Store a massive amount of text in a single column. Prove that Postgres moves it to external storage when it gets too large."
#### The Investigation
First, let's create a scratch table for our test:
```sql
CREATE TABLE toast_test (
biography TEXT
);
```
We can use `pg_column_size()` to see the physical size of a value as it is stored in the tuple.
```sql
-- 1. A small name fits comfortably inline
SELECT pg_column_size('Cashew'::text) AS size;
-- Result: 7 bytes
```
Now, let's create a massive biography. By default, Postgres uses the **EXTENDED** strategy, which attempts to compress the data first.
```sql
-- 2. A 5,000 character biography is highly compressible
SELECT pg_column_size(repeat('A', 5000)) AS size;
-- Result: 69 bytes
```
#### The Diagnosis
Wait—why is the 5,000-character string only 69 bytes? Postgres used its default compression (PGLZ). Because the compressed result (69 bytes) fits easily in the 8KB page, it stays **Inline**.
To force the data into **external storage**, we must either exceed the compression limit or disable compression entirely.
#### The Lazy Fix
Disable compression for the column to force external storage:
```sql
ALTER TABLE toast_test ALTER COLUMN biography SET STORAGE EXTERNAL;
INSERT INTO toast_test (biography) VALUES (repeat('B', 5000));
-- Check the size now
SELECT pg_column_size(biography) AS size FROM toast_test;
```
**Result**:
```text
size
-------
5000
```
#### The Reward
At 5,000 uncompressed bytes, this attribute has outgrown the ~2KB threshold and is moved to a dedicated TOAST table in the `pg_toast` schema, leaving only an 18-byte pointer in the main table.
> [!WARNING]
> **The `SELECT *` Penalty**: Out-of-line TOAST values must be reassembled. A `SELECT *` on a table with large text or JSONB fields forces additional reads and allocations even if you never use the values.
---
## 2.7 - Summary (Physical Storage & MVCC)
> The physical engine forbids erasers. Immutability guarantees that readers never block writers. To change a record is to abandon it and write a new version, leaving the dead behind for the vacuum to quietly sweep away.
<div style="page-break-after: always;"></div>