04 - Query Planning & Execution

# Chapter 4: Query Planning & Execution ## 4.0 - Query Planning & Operations (The Strategy of Execution) <img src="assets/arch_chap_3_lunch_rush_chef.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> When a query arrives, Postgres transforms from a storage engine into a processing pipeline. The engine translates parsed SQL into a physical execution plan. This plan is a sequence of low-level operators—scans, joins, and aggregations—that retrieve the requested data. The transition from declarative intent to imperative execution is the most critical phase of query performance. A well-chosen plan minimizes I/O and CPU usage. ### What You'll Learn - Why the **Plan Search Space** explodes combinatorially with query complexity - How the **Query Optimizer** evaluates join orderings, scan types, and index choices - The difference between a declarative SQL request and its imperative execution plan - Why sub-optimal plan selection is the single most common source of production-query slowness ### The Chapter Map Because this is the most complex machine in the engine, we will approach it in three phases: 1. **Phase 1: The Decision (4.1–4.2)** — How the Planner calculates cost and builds an algebraic execution tree ("How Postgres decides"). 2. **Phase 2: The Operators (4.3–4.6)** — The physical mechanics of Scans, Joins, Aggregations, and Memory operations ("What Postgres does"). 3. **Phase 3: Special Cases (4.7–4.10)** — Mutation paths, Parallelism, CTEs, and the art of Sargability ("How Postgres handles complexity"). In **[[Manuscript/03 - Access Paths & Indexing/3.0 - Indexes (The Mighty Indexes)|Chapter 3]]**, we built a library of access paths. Now, the engine must choose between multiple execution strategies. Having a hundred shortcuts is useless if you don't know which one to take. ### The Declarative Riddle SQL is a **Declarative** language. Unlike imperative programming, the client provides an **Intent** rather than a recipe. When a user submits a complex query—like finding the names of all animals who are waiting for a 'Pending' order—they are stating a **"What"**: ```sql SELECT DISTINCT a.name FROM animals a JOIN orders o ON a.id = o.animal_id JOIN animal_favorites af ON a.id = af.animal_id JOIN dish_ingredients di ON af.ingredient_id = di.ingredient_id JOIN dishes d ON di.dish_id = d.id WHERE d.name = 'Capybara''s Delight' AND o.status = 'Pending'; ``` The user does not say, "Consult the index for 'Capybara's Delight' or 'Join these tables first'." That would be **Imperative** instruction. Because the user only specifies the result, the engine must navigate a complex search space of execution plans to fulfill the request. ### The Plan Search Space As queries grow more complex, the number of ways to fulfill the order explodes. For the five-table join above, the engine must solve several crucial decisions: 1. **The Entry Point (Selectivity)**: Do we start with the singular "Capybara's Delight" dish (high selectivity) or the 'Pending' orders (potentially millions)? 2. **The Join Order**: In a 5-table join, there are **120 possible permutations** ($5!$) of join orders. Some might take 1 millisecond; others might take 1 hour. 3. **The Join Type**: Should we use a **Hash Join** (building an in-memory hash table), a **Merge Join** (sorting both lists first), or a **Nested Loop** (walking one list for every item in the other)? In a high-concurrency environment, a sub-optimal plan can lead to a system-wide stall. A plan that takes 1 hour instead of 1 millisecond is an $O(N^2)$ disaster. This is why Postgres employs the **Query Optimizer (The Planner)**. It analyzes the declarative "What" of the request and calculates the most efficient imperative "How." Its goal is to find the lowest-cost execution path through the **Plan Search Space**. > [!TIP] > The **`EXPLAIN`** command is your primary window into the planner's mind. It reveals the chosen plan nodes, estimated costs, and row counts before the query is actually executed. --- ## 4.1 - Query Planner (The Blueprint of Execution) <img src="assets/arch_planning_ledger.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Before execution begins, the **Query Planner** selects the most efficient physical path for the query. Whether it chooses a **[[Operations/Table/SeqScan|Sequential Scan]]** or an **[[Manuscript/03 - Access Paths & Indexing/3.1 - B-Tree (The Balanced Bookshelf)|Index Scan]]**, the planner relies on an internal **Cost Model** grounded in the physical reality of system resources. The planner evaluates potential plans based on their total estimated cost—a dimensionless value representing the resource consumption required to retrieve the data. > [!NOTE] The Click > **Concept**: The planner does not execute your SQL. It writes a C-struct tree of execution instructions. > **Payoff**: When you run `EXPLAIN`, you aren't running the query; you are looking at the raw instruction node tree generated by the planner. The executor simply takes this tree, walks it node by node, and asks each node: *"Give me the next tuple."* The query planner operates as part of a multi-stage translation pipeline: $\text{SQL Query} \longrightarrow \text{\textbf{Parser}} \text{ (Syntax Check)} \longrightarrow \text{\textbf{Analyzer}} \text{ (Catalog Resolution)} \longrightarrow \text{\textbf{Planner}} \text{ (Path Costing)} \longrightarrow \text{\textbf{Executor}} \text{ (Tuple Retrieval)}$ ### The Hierarchy of Slowness The planner optimizes for execution time by minimizing I/O and CPU cycles. To understand its decisions, you must understand the performance characteristics of modern hardware. If we scale a single **L1 Cache Reference (1.5 nanoseconds)** to **1.5 seconds**, the physical hierarchy looks like this: | Action | Latency (Physics) | Human Scale (1ns = 1s) | Planner Cost | | :--- | :--- | :--- | :--- | | **L1 Cache Reference** | 1.5 ns | 1.5 Seconds | `1.01` (cpu_tuple_cost) | | **Branch Mispredict** | 5 ns | 5 Seconds | — | | **L2 Cache Reference** | 7 ns | 7 Seconds | — | | **Main Memory Reference** | 100 ns | **2.6 Minutes** | `2.0` (seq_page_cost)* | | **SSD Random Read** | 150,000 ns | **2.7 Days** | `2.1` (Tuned SSD) | | **Disk Seek (HDD)** | 10,000,000 ns | **115 Days** | `5.0` (Default) | | **WAN (CA to Netherlands)** | 150,000,000 ns | **5.7 Years** | — | *\*Note: In Postgres, `seq_page_cost` includes the memory trip plus the CPU overhead of requesting the page from the OS.* ### The Planner's Cost Model In the physical world, a disk seek is significantly slower than a CPU cycle. The Planner’s **Cost Model** uses several tunable constants to estimate query overhead: - **`cpu_tuple_cost` (1.01)**: The cost of processing one row in memory. - **`seq_page_cost` (1.0)**: The cost of reading an adjacent page from disk. - **`random_page_cost` (4.0)**: The cost of a non-sequential seek. > [!TIP] > **Tuning for the Modern Era**: The default `5.0` for random reads was designed in the era of spinning rust. On modern NVMe SSDs, there is almost no seek penalty. Setting `random_page_cost` to **2.1** (or even **2.0**) tells the engine that it no longer needs to live in fear of the disk, often unlocking much more aggressive indexing strategies. The planner relies on table statistics to generate accurate cost estimates. If these statistics are outdated, the engine might select a sub-optimal access method—such as scanning a large table to retrieve a single row. ### Table Statistics (`ANALYZE`) Postgres periodically gathers statistics about the data in each table. You can trigger this process manually to ensure the planner has up-to-date information: ```sql -- Updating the planner's statistics ANALYZE animals; -- Inspecting Selectivity and Correlation SELECT n_distinct, correlation, most_common_vals, most_common_freqs FROM pg_stats WHERE tablename = 'animals' AND attname = 'species_id'; ``` ### The Math of Selectivity ($s$) The planner uses these statistics to calculate **Selectivity ($s$)**—the fraction of rows expected to pass a filter ($0 \le s \le 1$). - **Low Selectivity ($s \approx 0.9$)**: If Postgres knows 90% of your animals are 'Capybaras', the plan for `WHERE species_id = 5` will likely be a **Sequential Scan**. Why? Because reading nearly every page sequentially is cheaper than jumping back and forth with an index (Random I/O). - **High Selectivity ($s \approx 0.001$)**: A search for a rare species results in a tiny **Cardinality**. For a small number of rows, the Index Scan's random I/O penalty is worth the shortcut. > [!NOTE] > **A note on terminology**: some texts flip "high" and "low" selectivity. We use the convention that *high selectivity* means the predicate selects a small fraction of rows (a highly selective filter is a strict one); *low selectivity* means it lets most rows through. ### Physical Correlation The most subtle variable in the planner's ledger is **Correlation**. This represents the relationship between the physical storage order on disk and the logical value of the column. Postgres records it in `pg_stats.correlation` as a value between $-1.0$ and $+1.0$. > [!IMPORTANT] > **The Index Decision**: If data is physically ordered by the index key (correlation $\approx +1.0$), an Index Scan is fast. Consecutive index entries point to physically adjacent heap pages. If the correlation is near $0.0$, an Index Scan becomes a high-latency random I/O operation. In this case, the planner may prefer a Sequential Scan even for a small number of rows. For the deep mathematical specs on these cost constants, see **[[Manuscript/06 - Resource Management & Processes/6.0 - Memory & Disk (The Hierarchy of Inertia)|Chapter 6 - The Hunger of Resources]]**. --- ## 4.2 - Query Algebra (The Execution Tree) <img src="assets/arch_plan_algebra_summary.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> The Query Optimizer has finalized the plan—it has performed its economic calculus and selected the cheapest path. But a cost estimate is not a result set. To turn a blueprint into reality, Postgres must translate those mathematical expectations into physical execution. This translation is handled by the **Query Algebra**: a tree of highly specialized **Execution Nodes** (Plan Nodes). This is the execution tree where raw heap tuples are fetched, joined, and aggregated in a strictly coordinated sequence. ### The Demand-Driven Iterator Model (The Pull Model) To understand the Execution Engine, you must understand the **Demand-Driven Iterator Model**. Every node in the tree implements a standard interface with a single critical method: `ExecProcNode()` (commonly thought of as `GetNext()`). The parent node (the root) calls this method on its child. That child, in turn, calls it on its own children. This demand-driven "Pull" mechanism ensures that no record is processed until it is requested by the node above it. > [!IMPORTANT] > **The Execution Checkpoint**: If you remember one thing about the engine, let it be this: **Nothing happens unless it is asked for.** In a "Push" model, the bottom nodes would furiously fetch every row and shove them upward. In Postgres, the top of the tree must "request" a tuple before a single byte moves at the bottom. This architectural laziness is most visible when you request only a few results via a **[[Operations/ResultSet/Limit|LIMIT]]**. The top-level node simply stops calling for next tuples. That "stop" signal propagates down the tree instantly; downstream nodes release their resources and terminate their scans, regardless of how much data remains in the table. ```mermaid graph TD subgraph "Execution Pipeline (The Pull Model)" A[Limit] -->|"1: ExecProcNode"| B(Aggregate) B -->|"2: ExecProcNode"| C{Hash Join} C -->|"3: ExecProcNode"| D[Seq Scan: animals] D -.->|"4: Tuple"| C C -.->|"5: Tuple"| B B -.->|"6: Tuple"| A A -.->|"7: STOP"| B end classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px; classDef scan fill:#e1f5fe,stroke:#0288d1; class D scan; linkStyle 0,1,2 stroke:#2962ff,stroke-width:2px; linkStyle 3,4,5 stroke:#4caf50,stroke-width:2px; linkStyle 6 stroke:#f44336,stroke-width:2px,stroke-dasharray: 5 5; ``` The execution tree is composed of three primary node categories: 1. **[[Manuscript/04 - Query Planning & Execution/4.3 - Scans (The Full Table Walk)|Scans]]**: The source operators that fetch tuple from the tables. 2. **[[Manuscript/04 - Query Planning & Execution/4.4 - Joins (The Pairing Dance)|Joins]]**: The operators who cross-reference data between different tables. 3. **[[Manuscript/04 - Query Planning & Execution/4.5 - Aggregations (The Running Receipt)|Aggregations]]**: The operators who sort, group, and summarize the results. ### Performance Economics Every execution node requires resources. In **[[Manuscript/06 - Resource Management & Processes/6.0 - Memory & Disk (The Hierarchy of Inertia)|Chapter 6]]**, we will explore how nodes are allocated a small **Working Memory** (`work_mem`). If this memory limit is exceeded, a "Spill to Disk" occurs—turning a fast in-memory service into a slow I/O ordeal. > [!TIP] > **The Tuple Table Slot**: Moving data between millions of nodes is expensive. To avoid unnecessary memory copying, Postgres uses a **Tuple Table Slot**. This is a unified memory structure that carries a single "tuple" (tuple) between nodes, allowing different operations to examine the same data without relocating it in memory. Furthermore, even the best-organized plan runs into concurrency bottlenecks. In **[[Manuscript/07 - Wait Events & Concurrency/7.0 - Why Slow Queries Lie (The Waiting Game)|Chapter 7]]**, we will observe nodes competing for the same resources (Locks) or waiting for data to return from the storage layer (I/O Wait). By using the **[[Operations/_Operations|EXPLAIN ANALYZE]]** command, the database produces a **Service Receipt** detailing exactly how many tuples were processed and how much "sweat" (CPU/IO) was consumed by every node in the plan. --- ## 4.3 - Scans (The Full Table Walk) <img src="assets/ex_scan_raccoon_flashlight.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Before processing can occur, Postgres must fetch raw data from physical storage. This is the responsibility of the **Scan Nodes**. With the planner's decision framework established, we can now examine the individual operators it chooses from. We start with the most fundamental: how does Postgres physically retrieve rows from a table? **Scan Nodes** are the only operators that interact directly with the disk or **[[Manuscript/06 - Resource Management & Processes/6.2 - Shared Buffers (The Page Cache)|Shared Buffers]]**. They locate tuples and stream them upward into the execution plan. The choice of scan node determines the I/O pattern and overall retrieval efficiency. ### The Universal Scan Interface (`ExecScan`) Despite the difference between walking a table heap and traversing a B-Tree, every Scan Node in Postgres shares a common internal blueprint: **`ExecScan`**. This standardized interface allows the Query Algebra to operate consistently regardless of the underlying access method. This abstraction is extended by the **Table Access Method (Table AM)** API, which allows the storage layer to be pluggable. ### Sequential Scan: Linear Page Access The **[[Operations/Table/SeqScan|Sequential Scan]]** is the base case of data retrieval. Postgres performs a linear walk of the entire table heap, read-ahead buffer-by-buffer, to ensure 100% visibility of all qualifying tuples. > [!TIP] > **Synchronized Sequential Scans**: If multiple queries perform a sequential scan on the same large table, Postgres avoids redundant I/O. A new query can share the scan state of an existing one, looping back to the beginning once the end of the table is reached. ### Index Scan: Non-Linear Point Access The **[[Operations/Index/IndexScan|Index Scan]]** utilizes the maps created in Chapter 3 to bypass the linear scan. Postgres traverses the B-Tree to find the exact coordinates of the required tuples and performs a direct, non-linear fetch from the table heap. ### 🧪 Lab Challenge: The Missing Foreign Key Index **The Request**: "Count every order that contains Dish #5." #### The Naive Assumption Foreign keys ensure integrity, but they do **not** automatically create indexes. The `order_items` table has a composite primary key `(order_id, dish_id)`. This is great for finding "everything in order N" but useless for finding "every order that contains dish D"—because `dish_id` is the *trailing* column, not the leading one. ```sql EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM order_items WHERE dish_id = 5; ``` #### The Fallout With no usable index for `dish_id`, the planner falls back to a **Parallel Seq Scan**, sweeping all 300,000 rows across multiple workers: ```text QUERY PLAN --------------------------------------------------------------------------------------- Finalize Aggregate (cost=4871.83..4871.84 rows=1 width=8) Buffers: shared hit=1622 -> Gather (cost=4871.62..4871.83 rows=2 width=8) Workers Planned: 1 -> Partial Aggregate (cost=3871.62..3871.63 rows=1 width=8) -> Parallel Seq Scan on order_items (cost=0.00..3827.88 rows=17494 width=0) Filter: (dish_id = 5) ``` #### The Lazy Fix Add the missing index on the foreign-key column. The query collapses to an **Index Only Scan**, reading only 29 buffers instead of 1,622. ```sql CREATE INDEX idx_order_items_dish_id ON order_items(dish_id); ``` ### Bitmap Scan: The Reservation Map Suppose the engine needs to fetch 50 rows scattered across different pages of a massive table. An **Index Scan** would require 50 individual random I/O requests. A **Sequential Scan** would require reading every page in the table just to find those 50. The **Bitmap Scan** is a two-phase middle ground: 1. **[[Operations/Index/BitmapIndexScan|Bitmap Index Scan]]**: Postgres first scans the index and builds a **Bitmap**—an in-memory bitmask of all pages that contain matching records. 2. **[[Operations/Index/BitmapAndBitmapOr|Bitmap And / Or]]**: If the query has multiple filters, Postgres can combine these bitmaps using bitwise logic. 3. **[[Operations/Page/BitmapHeapScan|Bitmap Heap Scan]]**: The engine then visits the matching pages in physical order. By retrieving pages sequentially, it avoids the overhead of random I/O. ### EXPLAIN: The Smart Lap When you see a Bitmap Scan in your execution plan, you are seeing Postgres transition from "Point Searching" to "Bulk Retrieval." > [!NOTE] > **The Small Table Paradox**: For tiny tables, the planner almost always prefers a **Sequential Scan**. The overhead of opening the Index map is often greater than just glancing at the few pages of the whole table. ```sql -- Searching for all animals in two specific species -- (Forced Index Scan for demonstration) SET enable_seqscan = off; EXPLAIN SELECT * FROM animals WHERE species_id = 1 OR species_id = 5; -- Result: -- Bitmap Heap Scan on animals (cost=9.39..19.08 rows=11 width=48) -- Recheck Cond: ((species_id = 1) OR (species_id = 5)) -- -> BitmapOr (cost=9.39..9.39 rows=11 width=0) -- -> Bitmap Index Scan on idx_animals_species_id (cost=1.00..5.19 rows=5 width=0) -- Index Cond: (species_id = 1) -- -> Bitmap Index Scan on idx_animals_species_id (cost=1.00..5.19 rows=5 width=0) -- Index Cond: (species_id = 5) ``` > [!NOTE] > **The Recheck Condition**: Bitmaps are stored in **`work_mem`**. If the target set is too large to fit in memory, the bitmap becomes "Lossy" and marks whole pages instead of individual rows. Postgres must then "Recheck" the condition for every row on those pages to confirm they match the query criteria. ### 🧪 Lab Challenge: The Archival Panic (Bitmap Scans) **The Request**: "Count all orders that are either Pending or Cancelled, placed since 2021." #### The Naive Solution When you use an `OR` condition across multiple values, the engine cannot rely on a single, clean B-tree traversal. It reverts to its "in-memory scratchpad" strategy. ```sql EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM orders WHERE status IN ('Pending', 'Cancelled') AND order_time >= '2021-01-01'; ``` #### The Fallout Postgres builds an in-memory checklist (a bitmap) of matching rows before visiting the heap. ```text QUERY PLAN --------------------------------------------------------------------------------------- Aggregate (cost=2729.34..2729.35 rows=1 width=8) -> Bitmap Heap Scan on orders o (cost=532.13..2018.24 rows=50002 width=4) Recheck Cond: (status = ANY ('{Pending,Cancelled}'::order_status[])) Filter: (order_time >= '2021-01-01'::timestamp) -> Bitmap Index Scan on idx_orders_status (cost=1.00..519.63 rows=50007 width=0) Index Cond: (status = ANY ('{Pending,Cancelled}'::order_status[])) ``` 1. **`Bitmap Index Scan`**: Postgres builds an in-memory checklist (1s and 0s) of matching rows. 2. **`Bitmap Heap Scan`**: The engine visits the table heap in physical disk order, following the checklist. This is faster than random I/O but more flexible than a sequential scan. ### Efficiency Planning To understand the Planner's decisions, we must look at the **Estimated Cost**. The planner assigns a cost value to every operation. **The Sequential Fetch (The Default Walk):** ```sql EXPLAIN SELECT * FROM ingredients WHERE category = 'Herb'; -- Result: -- Seq Scan on ingredients (cost=1.00..2.40 rows=8 width=54) -- Filter: (category = 'Herb'::ingredient_category) ``` **The Index Fetch (The Point Search):** ```sql -- Forced Index Scan SET enable_seqscan = off; EXPLAIN SELECT * FROM ingredients WHERE id = 12; -- Result: -- Index Scan using ingredients_pkey on ingredients (cost=1.15..9.17 rows=1 width=54) -- Index Cond: (id = 12) ``` **State 3: Index-Only Scan (Heap Avoidance)** If the index map already contains all the data requested, Postgres avoids visiting the table heap entirely. ```sql -- Forced Index Only Scan SET enable_seqscan = off; SET enable_bitmapscan = off; EXPLAIN SELECT id FROM ingredients WHERE id < 100; -- Result: -- Index Only Scan using ingredients_pkey on ingredients (cost=1.12..9.14 rows=1 width=4) -- Index Cond: (id < 100) ``` > [!IMPORTANT] > **The Visibility Map (VM)**: An Index-Only Scan is not a guarantee. Because indexes do not store **MVCC (Visibility)** information, the engine must consult the **Visibility Map**. If the VM confirms the page is "clean," the fetch is successful. Otherwise, it must visit the heap. Notice that for our tiny `ingredients` table, the **Seq Scan (cost=2.40)** is actually "cheaper" than the **Index Scan (cost=9.17)**! Postgres is smart—it knows that for a small table, it doesn't need a map to find data. ### 🧪 Lab Challenge: The Covering Index (Heap Avoidance) **The Request**: "For the first 50 animals, show their name and how many orders each has placed." #### The Naive Solution The default index contains only the `animal_id`. The engine uses it to find the orders, but it must then visit the heap to retrieve the `id` for the count—triggering thousands of random heap fetches. ```sql EXPLAIN (ANALYZE, BUFFERS) SELECT a.name, count(o.id) FROM animals a JOIN orders o ON a.id = o.animal_id WHERE a.id < 50 GROUP BY a.name; ``` #### The Fallout The inner `Index Scan` visits the heap for every matching order, consuming 588 buffers: ```text -> Index Scan using idx_orders_animal_id on orders o (cost=0.29..33.69 rows=10 width=12) Index Cond: (animal_id = a.id) Buffers: shared hit=588 ``` #### The Lazy Fix A **Covering Index** uses the `INCLUDE` clause to attach extra columns to the index leaf pages. This allows an **Index Only Scan** that never touches the heap. ```sql CREATE INDEX idx_orders_animal_covering ON orders(animal_id) INCLUDE (id); ``` The buffer count drops from 588 to 148, and the scan becomes `Index Only Scan` with `Heap Fetches: 0`. The question was answered without ever opening the heap. ### Operation Ledger ![[Operations/Table/SeqScan]] ![[Operations/Other/CTEScan]] ![[Operations/Other/FunctionScan]] ![[Operations/Other/SubqueryScan]] ![[Operations/Other/CustomScanForeignScan]] ![[Operations/Table/TableFuncScan]] ![[Operations/Index/IndexScan]] ![[Operations/Index/IndexOnlyScan]] ![[Operations/Index/BitmapIndexScan]] ![[Operations/Page/BitmapHeapScan]] ![[Operations/Tuple/TidScan]] ![[Operations/Tuple/SampleScan]] ![[Operations/Other/ValuesScan]] --- ## 4.4 - Joins (The Pairing Dance) <img src="assets/arch_sous_chefs_joins.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> After fetching tuples, the engine must **Join** them. Join Nodes cross-reference two distinct streams of data to produce a single result set. Postgres utilizes three primary Join algorithms: **Nested Loop**, **Hash Join**, and **Merge Join**. The planner selects the algorithm based on row counts, index availability, and the sorted state of the data. > [!IMPORTANT] > **The Join Pull**: Remember the **[[Manuscript/04 - Query Planning & Execution/4.2 - Query Algebra (The Execution Tree)|Pull Model]]**! The join node doesn't just start mashing rows. It waits for the station above it to request a partner. Only then does it reach into its child nodes. The **[[Operations/ResultSet/NestedLoop|Nested Loop Join]]** is the base algorithm for joining two relations. For every record in the **Outer** relation, Postgres scans the **Inner** relation for matches. This algorithm has a worst-case complexity of $O(N \times M)$. The planner typically selects it when the Outer relation is small or the Inner relation has an index that allows for fast point lookups. ### 🧪 Lab Challenge: The Lateral Latency **The Request**: "Show me the top 3 items for every animal that has placed more than 15 orders." #### The Naive Solution Standard window functions or subqueries can be a nightmare for the parser. Postgres often feels forced to calculate the rank for *every single order item* in the entire database before it can filter. #### The Lazy Fix Use a **LATERAL Join**. This allows Postgres to iterate over the animals and, for each one, perform a surgical, indexed-backed lookup of exactly three items. ```sql SELECT a.name, top_dishes.dish_id FROM animals a CROSS JOIN LATERAL ( SELECT oi.dish_id FROM orders o JOIN order_items oi ON o.id = oi.order_id WHERE o.animal_id = a.id LIMIT 3 ) top_dishes WHERE (SELECT count(*) FROM orders WHERE animal_id = a.id) > 15; ``` **The Reward**: The plan shifts to a **Nested Loop**. Instead of a massive sort, Postgres fetches exactly what it needs for each animal and stops. > [!IMPORTANT] Prediction Checkpoint > Suppose you are joining a small lookup table of just 10 rows to a massive, unindexed table of 10 million rows. What join strategy should a smart query planner choose? Would it loop through the 10 rows and search the 10-million-row table sequentially 10 times, sort both tables entirely, or build a tiny in-memory hash table of the 10 rows and scan the large table once? The query planner will choose to build a hash table in memory for the 10 rows and stream the 10 million rows past it. This is a **Hash Join**. Even though one of the tables is massive, constructing a tiny hash table is extremely cheap, and streaming the 10 million rows allows the engine to complete the join in a single sequential pass, avoiding the CPU horror of multiple table scans or massive sorts. For large, unsorted datasets, Postgres utilizes a **[[Operations/ResultSet/HashJoin|Hash Join]]**. This is a two-phased operation: 1. **The Build**: Postgres takes the smaller relation (the "Inner") and constructs a **Hash Table** in memory. 2. **The Probe**: The engine then streams the larger relation (the "Outer") past the table, performing a near-instant lookup for every incoming tuple. ### 🧪 Lab Challenge: The Multi-Table Join Cascade **The Request**: "Count all deliveries for ingredients used in 'Dish 5'." #### The Fallout A non-sargable predicate (like `lower(name)`) at the top of a join chain can poison the planner's choice of join order all the way down. ```sql EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM dishes d JOIN dish_ingredients di ON d.id = di.dish_id JOIN ingredients i ON di.ingredient_id = i.id JOIN supply_deliveries sd ON i.id = sd.ingredient_id WHERE lower(d.name) = 'dish 5'; ``` Because `lower()` hides the physical value, the planner cannot enter the join graph through an index. It defaults to a **Hash Join** across the entire chain, building hash tables for every intermediate step. #### The Lazy Fix By using a sargable predicate (`WHERE d.name = 'Dish 5'`), the planner is willing to push that filter into an index scan. The join order flips: Postgres starts with the single dish and walks the index "conveyor belt" to find the ingredients and deliveries, avoiding the heavy hash tables entirely. > [!IMPORTANT] > **Multi-batch Hash Joins**: If the Hash Table exceeds the available **[[Manuscript/06 - Resource Management & Processes/6.3 - Work Mem (Private Working Memory)|work_mem]]**, Postgres partitions the data into **Batches** on disk. It processes one batch at a time, spilling the rest to temporary files. This ensures query completion at the cost of high Disk I/O. If both data streams are pre-sorted by the join key, Postgres deploys the **[[Operations/ResultSet/MergeJoin|Merge Join]]**. Two pointers move down the sorted lists in parallel. Because the data is ordered, the engine only ever looks forward. In a many-to-many join, the engine must temporarily **Mark** its position to iterate through matches and then **Restore** the pointer. This algorithm is $O(N + M)$ and uses significantly less memory than a Hash Join. ### 🧪 Lab Challenge: The Grand Reconciliation (Merge Join) **The Request**: "Count how many days we received a delivery on the exact same day an order was placed." #### The Fallout Joining 100,000 orders to deliveries is massive. If you use `to_char()` to join by date string, you destroy any inherent ordering in the data. Postgres must materialize and sort millions of derived strings into temporary files on disk. #### The Lazy Fix Using native datetime geometry (`date_trunc`) allows the engine to compare physical bits. If the data is indexed, the engine can deploy a **Merge Join**. ```text -> Merge Join (cost=10108.65..20118.65 rows=500000 width=0) Merge Cond: (date_trunc('day', sd.delivery_time) = date_trunc('day', o.order_time)) -> Sort (cost=67.83..70.33) Sort Key: (date_trunc('day', sd.delivery_time)) -> Sort (cost=10040.82..10290.82) Sort Key: (date_trunc('day', o.order_time)) ``` Postgres marches down the two sorted lists, counting matches as it goes, without ever building a monolithic hash table. ### 🧪 Lab Challenge: The Shadow of the Unknown (Anti-Joins) **The Request**: "Find the count of animals who have never placed an order." #### The Naive Solution The `NOT IN` trap is a classic. If the subquery returns even a single `NULL`, the entire query returns zero results. Because of these semantics, Postgres often falls back to a slow **Sequential Scan** with a subplan filter. #### The Lazy Fix Use `NOT EXISTS`. It has cleaner semantics and allows Postgres to use a **Hash Anti Join** or an **Index Anti Join**. ```sql EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM animals a WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.animal_id = a.id); ``` **The Reward**: Postgres uses a **Hash Right Anti Join**, scanning the orders table once to build a hash table, and then checking animals against it in a single pass. ### The Decision: Selecting a Join Type Let's observe how the planner chooses a Join algorithm based on table scale and statistics. ##### 1. Small Result Set Join (Nested Loop) When joining a single record to its related entities, the planner selects a Nested Loop. The low number of iterations makes the setup cost of other algorithms unnecessary. ```sql EXPLAIN SELECT dishes.name, ingredients.name FROM dishes JOIN dish_ingredients ON dishes.id = dish_ingredients.dish_id JOIN ingredients ON dish_ingredients.ingredient_id = ingredients.id WHERE dishes.id = 5; -- Result (The Indexed Join): -- Nested Loop Join (cost=1.42..12.31 rows=1 width=64) -- -> Index Scan using dishes_pkey on dishes d (cost=1.14..9.15 rows=1 width=36) -- -> Index Scan using dish_ingredients_pkey on dish_ingredients (cost=1.14..3.10 rows=1) ``` > [!TIP] > **Minimal Iteration Overhead**: The cost is low because the Outer relation (Dishes) is filtered to a single row. The Join is effectively reduced to a few pointer dereferences. ##### 2. The Strategy Shift (Hash Join vs. Nested Loop Join) When joining a large table like `supply_deliveries`, the planner's choice depends on the availability of indexes and the estimated output volume. **Plan 1: Unindexed Join (Hash Join)** Without a specific index on the join key in a large table, Postgres must construct an in-memory Hash Table to avoid a Cartesian product explosion. ```sql EXPLAIN SELECT ingredients.name, supply_deliveries.delivery_time FROM ingredients JOIN supply_deliveries ON ingredients.id = supply_deliveries.ingredient_id WHERE ingredients.id < 10; -- Results (The Hash Table Build): -- Hash Join (cost=5.00..96.70 rows=10 width=15) -- Hash Cond: (supply_deliveries.ingredient_id = ingredients.id) -- -> Seq Scan on supply_deliveries (cost=1.00..88.80 rows=1480) -- -> Hash (cost=4.00..4.00 rows=10) -- -> Seq Scan on ingredients (cost=1.00..4.00 rows=10) ``` **Plan 2: Indexed Join (Nested Loop)** Once an index is added (`CREATE INDEX idx_supply_ingredient ON supply_deliveries(ingredient_id);`), the plan shifts. Postgres can now iterate through `ingredients` and perform a fast index lookup for each match in `supply_deliveries`. ```sql -- Results: -- Nested Loop Join (cost=1.43..85.00 rows=100 width=64) -- -> Seq Scan on ingredients (cost=1.00..12.45 rows=10) -- -> Index Scan using idx_supply_ingredient on supply_deliveries (cost=1.43..7.50 rows=10) ``` ##### 3. The Sorted Join (Merge Join) If both input streams are pre-sorted (typically by an Index Scan on a B-Tree), the planner selects a Merge Join for $O(N+M)$ efficiency. ```sql SET enable_hashjoin = off; EXPLAIN SELECT ingredients.id, supply_deliveries.id FROM ingredients JOIN supply_deliveries ON ingredients.id = supply_deliveries.ingredient_id ORDER BY ingredients.id; -- Result: -- Merge Join (cost=1.30..150.00 rows=1000 width=8) -- Merge Cond: (ingredients.id = supply_deliveries.ingredient_id) -- -> Index Scan using ingredients_pkey on ingredients (...) -- -> Index Scan using idx_supply_ingredient on supply_deliveries (...) ``` ### Operation Ledger ![[Operations/ResultSet/NestedLoop]] ![[Operations/ResultSet/HashJoin]] ![[Operations/ResultSet/MergeJoin]] --- ## 4.5 - Aggregations (The Running Receipt) <img src="assets/arch_prep_station.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Once raw tuple have been fetched and joined, the final stage of the Query Algebra is to summarize and sort the resulting stream. In Postgres, this is the responsibility of the **Aggregate Nodes**. These nodes are the "summarizers" of the database—responsible for the `SUM`, `COUNT`, and `GROUP BY` logic. They work by iteratively applying a **Transition Function ($S_{func}$)** to a **Transition State ($S_{trans}$)** in memory until every tuple in the group has been processed. Let's see how the aggregation node summarizes our data. #### 1. HashAggregate: The Memory-Intensive Lookup If the incoming data is unsorted, Postgres utilizes a **HashAggregate** node. It builds a **Hash Table** in memory where every unique grouping key becomes a bucket. Each bucket stores the current **Transition State** for that group. As tuple stream through, $S_{func}$ is executed to update the state in the corresponding bucket. ```sql -- Squashing 10 million individual deliveries into totals per ingredient EXPLAIN SELECT ingredient_id, sum(quantity_kg) FROM supply_deliveries GROUP BY ingredient_id; -- Result (The Bucket Toss): -- HashAggregate (cost=105.30..106.55 rows=100 width=36) -- Group Key: ingredient_id ``` > [!TIP] > **The Memory Limit**: the success of a HashAggregate depends entirely on **[[Manuscript/06 - Resource Management & Processes/6.3 - Work Mem (Private Working Memory)|work_mem]]**. If the number of unique groups is so large that the Hash Table exceeds the allocated memory, the engine must "spill to disk"—writing intermediate buckets to temporary files, which dramatically increases execution time. #### 2. GroupAggregate: The Streaming Summary If the data arrives already sorted (either via a previous Sort node or an Index Scan), Postgres utilizes a **GroupAggregate**. This node is far more efficient than a HashAggregate because it only needs to keep track of the *current* group. When the value of the grouping column changes, it knows the previous group is finished and can emit the result immediately. ```sql -- The ingredients arrive already prepped (sorted) CREATE INDEX idx_supply_ingredient ON supply_deliveries(ingredient_id); EXPLAIN SELECT ingredient_id, sum(quantity_kg) FROM supply_deliveries GROUP BY ingredient_id; -- Result (The Clicker Pass): -- GroupAggregate (cost=1.28..110.19 rows=100 width=36) -- Group Key: ingredient_id ``` > [!TIP] > **Streaming Efficiency**: A **GroupAggregate** is the ultimate expression of the engine's design. It doesn't need to build a massive table in memory; it simply watches the sorted stream pass by, incrementing a scalar counter. It is a "streaming" operator that consumes almost no memory. #### 3. WindowAgg: The Peer Review A **Window Function** allows you to perform calculations across a set of rows that are related to the current row, without collapsing them into a single group. The **WindowAgg** node is responsible for this logic. ### 🧪 Lab Challenge: The Analytics Abyss **The Request**: "Give me the top 3 largest deliveries per quarter." #### The Naive Solution Using `row_number() OVER (PARTITION BY ...)` is standard, but the partition key determines how the engine groups data. ```sql SELECT id, quantity_kg, row_number() OVER (PARTITION BY extract(quarter from delivery_time) ORDER BY quantity_kg DESC) as rank FROM supply_deliveries ORDER BY rank LIMIT 3; ``` #### The Fallout Because `EXTRACT` is dynamic, the **WindowAgg** node forces an expensive **Sort** of all deliveries in memory. If your `work_mem` is small, this will spill to disk. ```text -> WindowAgg (cost=70.33..92.83 rows=1000 width=54) -> Sort (cost=70.33..72.83 rows=1000 width=46) Sort Key: (EXTRACT(quarter FROM delivery_time)), quantity_kg DESC ``` #### The Lazy Fix By partitioning on a predictable truncation like `date_trunc('quarter')`, the meticulous planner avoids deep mathematical evaluation for every tuple, optimizing the group-and-sort execution path. #### 3. The Execution Stop (Limit) This is Postgres's most effective method for resource conservation. If the query only requires a subset of the data, the **Limit** node halts the upward flow of tuples once the threshold is met. ```sql EXPLAIN SELECT * FROM supply_deliveries LIMIT 1; -- Result: -- Limit (cost=1.00..1.02 rows=1 width=44) -- -> Seq Scan on supply_deliveries (cost=1.00..1845.00 rows=10000000) ``` Notice the **Estimated Cost**. Even though the child node (the Sequential Scan) has a total cost of 1,845 points, the **Limit** node knows it only needs the first tuple. Because of the **Volcano Model**, Postgres pulls exactly one record and then terminates the downstream requests, resulting in a negligible execution cost. ### Operation Ledger ![[Operations/ResultSet/Aggregate]] ![[Operations/ResultSet/WindowAgg]] --- ## 4.6 - Memory Operations (Sort, Hash, and Spill) <img src="assets/ex_memory_ops_frog_sort.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Once the raw ingredients have been gathered (Scans), paired (Joins), and perhaps summarized (Aggregations), the engine still has to prepare the final presentation. This is the memory-bound stage of the query algebra—the stage where the engine organizes, truncates, and combines intermediate results into the final set returned to the client. Operations in this layer deal with ordering, truncation, and set math. Unlike Scans, they do not retrieve data from the physical tables; they operate entirely on the data streams passed up from the lower nodes in the plan tree. > [!WARNING] > **The Memory Limit (Spill to Disk)**: Logistical nodes like `Sort` and `Hash` are the most common consumers of **`work_mem`**. If the intermediate result set exceeds this allocated memory, the engine must "spill" the data to temporary files on the disk. This transforms a fast in-memory operation into a slow, I/O-bound process. ### Ordering and Truncation The most common logistical operations are `ORDER BY` and `LIMIT`. These tell the engine to organize the results into a specific sequence and to stop processing once a certain count is reached. ![[Operations/ResultSet/Sort]] ![[Operations/ResultSet/Limit]] > [!TIP] > **The Top-N Sort Optimization**: If you ask for `ORDER BY price DESC LIMIT 10`, Postgres doesn't sort the entire multi-million row table. It uses a specialized memory structure called a "Top-N Heapsort" to keep track of only the top 10 items as it scans, saving massive amounts of `work_mem` and time. ### Set Operations and Unions Sometimes a query requires bringing together entirely separate logic trees—for instance, a `UNION` combining two different menu searches. The Memory Operations provides nodes to concatenate or mathematically merge these result sets. ![[Operations/ResultSet/Append]] ![[Operations/ResultSet/MergeAppend]] ![[Operations/ResultSet/SetOp]] ### Materialization and Uniqueness When a complex subquery result needs to be referenced multiple times, the engine will "plate" it in a temporary memory space so it doesn't have to be recalculated. Similarly, if the query requires `DISTINCT` results, the engine must filter out identical rows. ![[Operations/ResultSet/Materialize]] ![[Operations/ResultSet/Unique]] ![[Operations/ResultSet/ProjectSet]] ![[Operations/ResultSet/Hash]] These finishing touches are usually fast, but if the result set is massive, operations like `Sort` or `Hash` will spill to temporary files on disk, generating I/O wait events and slowing down the query. --- ## 4.7 - Mutation Path (The Write Pipeline) <img src="assets/ex_mutation_axolotl_stamp.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Everything we have discussed so far in the query algebra—Scans, Joins, Aggregations, and Memory operations—has been purely observant. The engine has been reading indices, scanning heap pages, and organizing results without altering the source data. Everything above described the read path — how Postgres finds and combines data. The write path is a different machine with different constraints. But databases are not just read-only archives. Eventually, the application must **Modify the State** of the system. In Postgres, mutation is handled as just another node at the very top of the execution tree. The lower nodes figure out *which* tuple need to be updated or deleted, and they pass those physical addresses up to the **ModifyTable** node, which actually applies MVCC visibility system (MVCC) rules. ### The ModifyTable Node This is the workhorse of all `INSERT`, `UPDATE`, and `DELETE` operations. It takes the result set from the children and applies the changes to the physical pages. ![[Operations/Tuple/ModifyTable]] > [!IMPORTANT] > **The Hidden Reads**: An `UPDATE` or `DELETE` requires finding the data first! Every mutation query contains an implicit `SELECT` statement underneath it. If your `UPDATE` is slow, it is almost certainly because the implicit `SeqScan` or `IndexScan` feeding the `ModifyTable` node is inefficient, not the mutation itself. > > **Index Your Writes**: Just like a `SELECT`, an `UPDATE` that searches by a function will trigger a full Sequential Scan. You can use **Functional Indexes** to create high-speed shortcuts for your mutation queries. ### 🧪 Lab Challenge: The Blind Update (Functional Indexes) **The Request**: "The nightly cleanup job is taking too long and locking up the system." #### The Naive Solution Every `UPDATE` is a hidden `SELECT`. If the application wraps the filter in a function, the engine loses its shortcut and falls back to a full Seq Scan, holding row locks for far too long. ```sql UPDATE orders SET status = 'Cancelled' WHERE extract(year from order_time AT TIME ZONE 'UTC') = 2025; ``` #### The Fallout Wrapped in `extract()`, the index is useless. The planner reads the entire 100K-row `orders` table just to find the few rows it needs to update. #### The Lazy Fix Build a map shaped exactly like the application's blind spot: a **Functional Index**. ```sql CREATE INDEX idx_orders_year ON orders( (EXTRACT(year FROM order_time AT TIME ZONE 'UTC')) ); ``` ### Concurrency and Locking What happens if two backend processes try to update the exact same ingredient tuple at the exact same time? The engine must orchestrate a queue so they don't overwrite each other. Before a mutation can occur, the engine may need to acquire a lock on the specific row. ### 🧪 Lab Challenge: The Row Lock Contention (LockRows) **The Request**: "Apply a 10% price markup to all dishes that contain 'Spice'." #### The Fallout Because `lower()` forces a full pass to find matching dishes, the **ModifyTable** node is fed a fat, slow pipeline of rows. This operation will lock rows unpredictably and block other concurrent writes. ```sql UPDATE dishes SET price = price * 1.10 WHERE id IN ( SELECT d.id FROM dishes d JOIN dish_ingredients di ON d.id = di.dish_id JOIN ingredients i ON di.ingredient_id = i.id WHERE lower(i.category::text) = 'spice' ); ``` #### The Lazy Fix Remove the `lower()` cast. The planner drives the entire process via targeted index lookups. Postgres will find the specific dishes and update them surgically, locking only what it needs for microscopically short durations. ![[Operations/Tuple/LockRows]] ### The Result Node (Simple Inserts) Sometimes, there is no need to scan anything. A simple `INSERT INTO ingredients VALUES ('Saffron')` does not have a child scan node because there's nothing to search for. Instead, the planner generates a trivial `Result` node that hands the hardcoded values directly to `ModifyTable`. ![[Operations/Tuple/Result]] When mutations occur, they don't just change the tuple; they also trigger the **WAL (WAL)** to record the change, and update the **Bookshelves (Indexes)** to reflect the new addresses. Mutation is the heaviest operation in the engine. While a single backend process can meticulously update every record in a shipment, some tasks are simply too large for one worker. When the scale of the scan or the weight of the mutation becomes a mountain, Postgres calls for reinforcements. --- ## 4.8 - Parallel & Distributed (The Worker Pool) <img src="assets/ex_parallel_pigeon_team.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> By default, Postgres assigns exactly one backend process to every query. If a query requires scanning millions of rows — say, aggregating a decade of supply deliveries — that single process will read, sort, and aggregate alone. To speed this up, Postgres can split the work. The planner decides the scan is large enough to justify coordination overhead, so the primary backend (the **leader**) spawns several **Parallel Workers** to divide the labor. ### Gathering the Results When a worker finishes its chunk, it doesn't return results to the client directly. It passes them back to the leader via a shared memory queue. This is the **Gather** operation. ![[Operations/ResultSet/GatherMerge]] > [!TIP] > **Parallel Overheads**: Spawning workers takes time and memory. The planner uses `min_parallel_table_scan_size` (default 8MB) to decide if a table is large enough to justify parallelism. Don't force parallel queries on small tables — the coordination overhead will exceed the cost of a single-process scan. ### Distributed Architectures In multi-node setups (like Citus or other distributed Postgres architectures), data is spread across multiple servers. A query might need to read from several nodes simultaneously. Instead of passing data through shared memory, nodes exchange tuples over the network. The planner introduces specific distributed operations to route the data efficiently. ![[Operations/Distributed/Broadcast]] ![[Operations/Distributed/Redistribute]] ![[Operations/Distributed/Gather]] Whether parallelizing across local CPU cores or distributing across the cloud, the principle is identical: divide the scan, process independently, and funnel results back to a single leader for final assembly. --- ## 4.9 - Common Table Expressions (The Temporary Station) <img src="assets/arch_cte_station.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> A **Common Table Expression (CTE)** is a temporary result set that you can reference within another SQL statement. In the Query Algebra, these are handled by **CTE Scan** nodes. Postgres treats a CTE as a discrete logical unit. The engine prepares a specific set of data, assigns it a name, and allows other operations in the plan tree to reference it as if it were a permanent table. ### The Materialization Fence By default, Postgres "materializes" CTEs. This means it executes the CTE query in its entirety, writes the result to an internal temporary storage, and then scans that storage. This creates an **Optimization Fence**: the planner generally cannot push filters from the main query down into the CTE, and vice-versa. This is powerful for controlling join order but can be a performance trap if the CTE is massive. ### 🧪 Lab Challenge: The Supply Chain Labyrinth (Recursive CTEs) **The Request**: "Trace a dish's ingredient lineage all the way down the supply chain." #### The Naive Solution Using `WITH RECURSIVE` enables powerful graph traversal. But adding a non-sargable string cast outside the CTE acts as an Optimization Fence, forcing Postgres to build the entire chain in memory before filtering. ```sql WITH RECURSIVE supply_chain AS ( SELECT id, ingredient_id FROM supply_deliveries UNION ALL SELECT sd.id, sd.ingredient_id FROM supply_deliveries sd JOIN supply_chain sc ON sd.id = sc.id + 1 ) SELECT * FROM supply_chain sc WHERE sc.id::text = '1'; ``` #### The Fallout Postgres recursively builds the *entire* supply chain in memory (100,000+ rows) before it throws 99% of it away. You will see a **`WorkTable Scan`**—the engine reading its own temporary scratchpad to feed the recursion. #### The Lazy Fix Push the exact filter directly into the base case (the "anchor" query) of the CTE. ```sql WITH RECURSIVE supply_chain AS ( SELECT id, ingredient_id FROM supply_deliveries WHERE id = 1 UNION ALL -- ... ``` By filtering early, the engine only ever prepares the specific slice of data it needs, keeping the CTE scan clean and fast. ### Operation Ledger ![[Operations/Other/CTEScan]] --- ## 4.10 - Sargability (The Art of Not Opening Every Box) <img src="assets/arch_sargability_safe_map.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> **Sargability** (Search ARGument-ABLE) defines whether a query predicate can be resolved using an index. A sargable predicate allows the engine to perform a targeted index lookup. A non-sargable predicate forces the engine to perform a full sequential scan of the table heap. For example, `WHERE id = 12` allows a fast B-Tree traversal. However, wrapping the column in a function—like `WHERE lower(name) = 'saffron'`—prevents index usage. Indexes are built using **Operator Classes** (`opclass`) that map raw values, not the results of functions. --- ### The Technical Gap: Why Functions Break Indexes Indexes are built using specific **Operator Classes** (`opclass`). For example, a B-Tree on a `TEXT` column uses the `text_ops` class, which defines how to evaluate `<`, `>`, and `=`. When you wrap a column in a function like `lower()`, you are no longer comparing the raw column; you are comparing a new, derived value that the index has not indexed. The Planner cannot find a matching operator in the index's `opclass` for your function result, and thus default to a Sequential Scan. --- ### The Decision Chain: Selectivity and Join Order Sargability is critical for **Selectivity Estimation**. The **[[Manuscript/04 - Query Planning & Execution/4.1 - Query Planner (The Blueprint of Execution)|Planner]]** needs to estimate the number of rows a filter will return to build an optimal plan tree. - **Sargable**: The Planner consults table statistics (MCVs and Histograms). It sees exactly how many rows are expected to pass and chooses an **Index Scan** if the volume is low. This high-fidelity estimation allows the planner to select the correct **Join Order** (e.g., joining a small, filtered set into a larger table using a Nested Loop). - **Non-Sargable**: The Planner cannot apply statistics to a custom function result. It defaults to a generic guess (often 1% or 0.5%). This "Selectivity Blindness" is catastrophic during a JOIN—the planner may over-estimate the rows and incorrectly choose a **Hash Join**, wasting memory and CPU cycles on an unnecessary strategy. Without accurate selectivity estimates, the planner defaults to the most conservative operator: the **Sequential Scan**. --- The **Index Scan** is infinitely more efficient than walking the whole room! ### 🧪 Lab Challenge: The Arithmetic Trap **The Request**: "Show all deliveries that were technically due before New Year's 2024, after accounting for a one-day processing buffer." #### The Fallout Math on the column side prevents the index from being used. Adding an interval to the column forces Postgres to compute a new value for every row before it can compare anything. ```sql SELECT count(*) FROM supply_deliveries WHERE delivery_time + interval '1 day' < '2024-01-01'; ``` The planner reverts to a **Seq Scan** because the engine cannot push arithmetic into the B-Tree map. #### The Lazy Fix Isolate the column. Move the arithmetic to the constant on the right-hand side. The planner now has a sargable predicate it can hand straight to the index. ```sql SELECT count(*) FROM supply_deliveries WHERE delivery_time < '2024-01-01'::timestamptz - interval '1 day'; ``` --- ### Optimization: Predicate Pushdown Sargability is the fuel for **Predicate Pushdown**. When the Planner identifies a sargable filter, it can "push" that logic as deep as possible into the execution tree. This minimizes the volume of data that must be joined and sorted by higher-level nodes. **Scenario**: Find all ingredients used in "Dish 5". #### State 1: The Express Assembly (Sargable Pushdown) If we use a sargable filter (`WHERE d.id = 5`), the Query Optimizer pushes the ID filter down to the lowest level. ```sql -- ✅ The Express Assembly EXPLAIN SELECT * FROM ingredients i JOIN dish_ingredients di ON i.id = di.ingredient_id JOIN dishes d ON di.dish_id = d.id WHERE d.id = 5; -- Results: -- Nested Loop (cost=1.42..10.51 rows=1 width=76) -- -> Nested Loop (cost=1.28..10.21 rows=1 width=36) -- -> Index Scan on dishes d (cost=1.14..9.15 rows=1 width=12) -- Index Cond: (id = 5) -- -> Index Scan on dish_ingredients di (cost=1.14..3.06 rows=1 width=12) -- Index Cond: (dish_id = d.id) -- -> Index Scan on ingredients i (cost=1.14..1.30 rows=1 width=40) -- Index Cond: (id = di.ingredient_id) ``` #### State 2: No Pushdown (Non-Sargable) If we use a non-sargable filter (`WHERE lower(d.name) = 'dish 5'`), the planner can't use the index. It falls back to a **Hash Join** across the entire chain: ```sql -- ❌ Non-sargable filter EXPLAIN SELECT * FROM ingredients i ... WHERE lower(d.name) = 'dish 5'; -- Results: -- Hash Join (cost=5.85..8.22 rows=1 width=76) -- Hash Cond: (di.dish_id = d.id) -- -> Seq Scan on dish_ingredients di (cost=1.00..2.91 rows=91 width=14) -- -> Hash (cost=5.84..5.84 rows=1 width=19) -- -> Seq Scan on dishes d (cost=1.00..5.84 rows=1 width=19) -- Filter: (lower(name) = 'dish 5'::text) ``` Instead of targeted index lookups, the engine must build hash tables and sequentially scan every table. **Predicate Pushdown** failed because the predicate was hidden behind a function. > [!TIP] > If you must query using a function, create an **Expression Index**. For example, `CREATE INDEX idx_dishes_lower_name ON dishes (lower(name))` allows Postgres to index the *result* of the function, restoring sargability. ### 🧪 Lab Challenge: The Multicolumn Index **The Request**: "Show all deliveries from Supplier #3 that arrived in March 2024." #### The Naive Solution With only a single-column index on `delivery_time`, the planner walks the time range and then filters by supplier as a row-level check. Every row inside the date range is fetched from the heap before the supplier is checked. #### The Lazy Fix A **Multicolumn Index** gives the engine a single B-Tree that orders rows first by supplier, then by time. ```sql CREATE INDEX idx_supply_supplier_time ON supply_deliveries(supplier_id, delivery_time); ``` > [!IMPORTANT] > **Column Order Matters**: The leftmost column is the primary sort key. `(supplier_id, delivery_time)` is perfect for this query, but **useless** for "all deliveries in March regardless of supplier." The rule: equality column first, range column second. --- ### Common Non-Sargable Anti-Patterns | Anti-Pattern | Technical Limitation | Sargable Alternative | | :--- | :--- | :--- | | `WHERE lower(name) = 'saffron'` | Function result is missing from index. | Use an **Expression Index**. | | `WHERE date + '1 day' > now()` | Predicate math breaks boundary checks. | `WHERE date > now() - '1 day'`. | | `WHERE coalesce(status, 'P') = 'P'` | Null-handling masks physical values. | Use a **Partial Index**. | | `WHERE name LIKE '%spice%'` | B-Tree only supports prefix matching. | Use a **GIN Index** with `pg_trgm`. | --- ### Recap: The Sargability Rules > [!IMPORTANT] > - **Never wrap the column in a function** if you want to use an index. > - **Keep the math on the "value" side** (the right side) of the operator. > - **Non-sargable filters blind the Planner**, leading to terrible join decisions. > - **Sargability enables Index-Only Scans**, allowing the engine to skip the heap entirely. --- ## 4.11 - Summary (Query Planning & Execution) > The planner is a cost-based accountant, completely indifferent to how your SQL is formatted. It translates your intent into a physical pipeline of scans and joins, relentlessly seeking the path of least resistance. <div style="page-break-after: always;"></div>