restored_7_9 - Aesa's Notes

## 7.9 Performance Audit (Wait Event Exercises) ![[assets/arch_performance_audit.png|450]] Welcome to the **Performance Audit**. In Chapter 4, the subject of investigation was always the Query Plan. Here, the subject is the *entire running database*. The wait event is the primary symptom. **`pg_wait_tracer`** provides the high-precision timeline. `EXPLAIN (ANALYZE, BUFFERS)` provides the architectural audit. For each challenge, a query has gone wrong. Run it against your local Elephant Cafe database, observe both the plan output *and* what the **`pg_wait_tracer`** reveals, and then apply the fix. > [!NOTE] > **Setup**: These exercises work best with a large dataset. Run `psql -f scripts/seed_large.sql` to populate the database with 100,000+ orders and 20,000 animals. Then run `ANALYZE;` to refresh the planner's statistics. --- ### Challenge 1: The Silent Sweat (Classifying a NULL Wait Event) **The Complaint**: "Our nightly report is slow. I checked `pg_stat_activity` and the process doesn't seem to be waiting for anything!" **The Clue**: A NULL `wait_event` during high CPU usage. #### The Naive Query ```sql -- ❌ The Slow Nightly Report EXPLAIN (ANALYZE, BUFFERS) SELECT s.name AS species, s.diet_type, count(o.id) AS total_orders, sum(oi.quantity) AS total_items, avg(i.base_cost_per_kg) AS avg_ingredient_cost FROM animals a JOIN species s ON a.species_id = s.id JOIN orders o ON o.animal_id = a.id JOIN order_items oi ON oi.order_id = o.id JOIN dishes d ON d.id = oi.dish_id JOIN dish_ingredients di ON di.dish_id = d.id JOIN ingredients i ON i.id = di.ingredient_id WHERE extract(year FROM o.order_time) = 2024 GROUP BY s.name, s.diet_type ORDER BY total_orders DESC; ``` #### The Audit Log To see this in action, we can simulate multiple waiters hitting this exact report simultaneously using `pgbench`: ```bash # Save the bad query to a file echo " SELECT s.name, sum(oi.quantity), avg(i.base_cost_per_kg) FROM animals a JOIN species s ON a.species_id = s.id JOIN orders o ON o.animal_id = a.id JOIN order_items oi ON oi.order_id = o.id JOIN dishes d ON d.id = oi.dish_id JOIN dish_ingredients di ON di.dish_id = d.id JOIN ingredients i ON i.id = di.ingredient_id WHERE extract(year FROM o.order_time) = 2024 GROUP BY s.name; " > slow_report.sql # Simulate 4 concurrent analysts running it repeatedly for 10 seconds pgbench -c 4 -T 10 -f slow_report.sql -U postgres elephant_cafe ``` While that is running, trace it using `pg_wait_tracer` in another terminal: ```bash # Capture and dump the precise wait summary sudo ./pg_wait_tracer --count 1 --interval 5 --view time_model --pgdata /var/lib/postgresql/data ``` **What you will see**: ```text --- 2024-05-20 14:30:05 --- ════════════════════════════════════════════════════════════════════════════════ pg_wait_tracer — Time Model Backends: 12 Interval: 5s ════════════════════════════════════════════════════════════════════════════════ Wait Class AAS % DB Time CPU% Wait% ────────── ────── ──────── ────── ────── CPU 4.96 99.1% 100% 0% Client 1.04 1.9% 0% 100% IO 1.00 0% 0% 0% ``` The tracer returns an Average Active Sessions (AAS) value very close to `5.0` (your 4 pgbench connections). But more importantly, the `Wait%` for the cluster is near `0%`, and the `CPU%` is `100%`. There is no `DataFileRead`. The process is completely CPU-bound. The EXPLAIN output reveals why: ```text -- Key Plan Signals: Gather (actual time=2.234..8923.456 rows=5 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial HashAggregate (...) -> Parallel Hash Join (...) -> Parallel Seq Scan on orders o (...) Filter: (EXTRACT(year FROM order_time) = 2024) Rows Removed by Filter: 150000 -- ⚠️ Too much work ``` **Diagnosis**: This is a **Complexity Failure** (The Hamster Wheel). The `extract(year FROM o.order_time)` predicate is non-sargable — it forces a full Sequential Scan of the orders table and then discards 150,000 rows. The 7-table join is then computed over the full unfiltered dataset. #### The Lazy Fix Unwrap the date function. Give the planner a range it can use with the `idx_orders_order_time` index: ```sql -- ✅ The Fix: Sargable date range EXPLAIN (ANALYZE, BUFFERS) SELECT s.name AS species, s.diet_type, count(o.id) AS total_orders, sum(oi.quantity) AS total_items, avg(i.base_cost_per_kg) AS avg_ingredient_cost FROM animals a JOIN species s ON a.species_id = s.id JOIN orders o ON o.animal_id = a.id JOIN order_items oi ON oi.order_id = o.id JOIN dishes d ON d.id = oi.dish_id JOIN dish_ingredients di ON di.dish_id = d.id JOIN ingredients i ON i.id = di.ingredient_id WHERE o.order_time >= '2024-01-01' AND o.order_time < '2025-01-01' GROUP BY s.name, s.diet_type ORDER BY total_orders DESC; ``` **The Reward**: The plan's entry point shifts from a Parallel Seq Scan to a targeted **Index Scan** on `order_time`. The 7-table join is now computed against only the 2024 rows, not all 200,000. > [!IMPORTANT] > **Classification**: `wait_event IS NULL` + high CPU + long execution = **Complexity Failure**. The fix is always architectural: fix the plan, not the hardware. --- ### Challenge 2: Reading the Elevator (I/O Wait Triage) **The Complaint**: "A simple aggregation over supply deliveries takes 12 seconds in the morning. CPU looks idle. Something else is going on." **The Clue**: `wait_event_type = 'IO'`, `wait_event = 'DataFileRead'`. #### The Naive Query ```sql -- ❌ The Cold-Cache Scan EXPLAIN (ANALYZE, BUFFERS) SELECT supplier_id, count(*) AS delivery_count, sum(quantity_kg) AS total_kg FROM supply_deliveries GROUP BY supplier_id ORDER BY total_kg DESC; ``` #### The Audit Log ```bash # Simulate 10 connections hitting the un-cached aggregate echo "SELECT supplier_id, count(*) FROM supply_deliveries GROUP BY supplier_id;" > scan.sql pgbench -c 10 -T 10 -f scan.sql -U postgres elephant_cafe ``` While pgbench fires, record the trace: ```bash sudo ./pg_wait_tracer --count 1 --interval 5 --view system_event --pgdata /var/lib/postgresql/data ``` **What you will see**: ```text --- 2024-05-20 14:40:05 --- ════════════════════════════════════════════════════════════════════════════════ pg_wait_tracer — System Events Backends: 18 Interval: 5s ════════════════════════════════════════════════════════════════════════════════ Wait Class Wait Event AAS % DB Time Wait (ms) ────────── ─────────────── ────── ──────── ──────── IO DataFileRead 10.12 91.2% 45600 CPU — 1.85 9.5% 4250 Client ClientRead 1.03 1.3% 150 ``` The tracer reports an AAS near `10.0`. But unlike the CPU bound query, the Time Model shows `Wait%` at `90%+`. The `IO:DataFileRead` event dominates the stack. The CPU is idle, waiting for the Elevator. ```text -- Key Plan Signals: Seq Scan on supply_deliveries (actual time=1.023..1823.234 rows=100000 loops=1) Buffers: shared hit=3 read=694 -- ⚠️ 694 physical page reads! ``` **Diagnosis**: This is a **Throughput Failure** (The Elevator Queue). The `supply_deliveries` table's 694 pages do not fit in `shared_buffers`. Each page must be fetched from physical disk — the "elevator" bringing suitcases up from the Filing Cabinet. #### The Fix Strategy This challenge has no single SQL fix — it is a **configuration and hardware decision**: ```sql -- Diagnostic 1: How large is the table in pages? SELECT relname, relpages, pg_size_pretty(pg_relation_size(oid)) FROM pg_class WHERE relname = 'supply_deliveries'; -- Diagnostic 2: What fraction of reads are buffer hits? SELECT relname, heap_blks_read, heap_blks_hit, round(heap_blks_hit::numeric / (heap_blks_hit + heap_blks_read + 1) * 100, 2) AS hit_pct FROM pg_statio_user_tables WHERE relname = 'supply_deliveries'; ``` **Fix Ladder** (ordered by impact): 1. **Increase `shared_buffers`**: The Warming Rack. If the table's working set fits in RAM, the reads disappear on the second execution. 2. **Add a targeted index**: If the `GROUP BY supplier_id` query is frequent, an index on `supplier_id` can allow an **Index Scan** with far fewer page reads. 3. **Upgrade storage to SSD**: If the data is genuinely too large to cache, faster I/O is the only hardware lever. > [!TIP] > **The Buffer Hit Ratio**: A healthy OLTP database should have a buffer hit rate above **99%**. If you see 60–80%, your `shared_buffers` is too small for your working set. Add RAM before you add indexes. --- ### Challenge 3: The Stuck Frog (Diagnosing Lock Contention) **The Complaint**: "Our order processing API has started queuing up. Simple `UPDATE` calls are taking 30+ seconds." **The Clue**: `wait_event_type = 'Lock'`, `wait_event = 'transactionid'`. #### The Audit Log In your first session, simulate a long-running transaction that holds a row lock: ```sql -- Session 1: The Blocker (do not COMMIT yet) BEGIN; UPDATE orders SET status = 'Cooking' WHERE id = 1; -- ↑ This holds a RowExclusive lock and an exclusive lock on its own XID. ``` In a second session, simulate a competing update: ```sql -- Session 2: The Blocked Process UPDATE orders SET status = 'Cancelled' WHERE id = 1; -- ↑ This will hang, waiting for Session 1's transaction to release. ``` While Session 2 is waiting, inspect the active sessions using the tracer's **Active view** to identify the bottleneck: ```bash # Observe the active sessions in real-time sudo ./pg_wait_tracer --view active --count 1 ``` **What you will see**: ```text PID State Wait Event Wait (ms) DB Time (ms) Backend Type ────── ───────── ─────────────────── ────────── ──────────── ──────────── 8551 waiting Lock:transactionid 14200.0 15800.0 client 8505 idle — — 240.5 client ``` Session 2 (PID 8551) is `waiting` on a `transactionid` lock. Because the database is otherwise quiet (PID 8505 is `idle`), we know that 8505 is the "Blocker"—it holds a row lock but is doing nothing, likely because a user opened a transaction and forgot to finish it. **Diagnosis**: This is a **Coordination Failure** (The Stuck Frog / Traffic Jam). Improving hardware is entirely useless. The fix is in **transaction management**. #### The Fix Strategies ```sql -- Fix 1: Always commit or rollback promptly in application code. -- The blocker is Session 1 holding an idle open transaction. -- Commit it: COMMIT; -- Fix 2: Use lock_timeout to prevent indefinite waits SET lock_timeout = '5s'; UPDATE orders SET status = 'Cancelled' WHERE id = 1; -- This will raise an error after 5s rather than hanging indefinitely. -- Fix 3: Use advisory locks for application-level sequencing -- (For cases where table-level row contention is unavoidable) -- Fix 4: Find and kill a truly stuck blocker (use carefully in production) SELECT pg_terminate_backend(<blocking_pid>); ``` > [!WARNING] > **The Lock Cascade**: Lock contention doesn't stay isolated. If Process A is blocked by Process B, and 20 application threads all issue the same `UPDATE`, they will all queue up behind Process A. One forgotten `BEGIN` with no `COMMIT` can stall an entire API tier. --- ### Challenge 4: The Overflowing Suitcase (work_mem and Sort Spill) **The Complaint**: "Our monthly invoice sort is using a lot of disk. The `temp_files` counter in `pg_stat_database` is growing." **The Clue**: `wait_event = 'BufFileWrite'`, `Sort Method: external merge Disk:` in EXPLAIN. #### The Naive Query ```sql -- ❌ The Memory-Starved Sort EXPLAIN (ANALYZE, BUFFERS) SELECT a.name, o.order_time, sum(oi.quantity * d.price) AS invoice_total FROM animals a JOIN orders o ON o.animal_id = a.id JOIN order_items oi ON oi.order_id = o.id JOIN dishes d ON d.id = oi.dish_id GROUP BY a.name, o.order_time, o.id ORDER BY a.name, o.order_time; ``` #### The Fallout On a large dataset, the plan will contain: ```text -- Key Plan Signals: Sort (actual time=5234.123..5891.456 rows=200000 loops=1) Sort Key: a.name, o.order_time Sort Method: external merge Disk: 8864kB -- ⚠️ Spilled to disk! Buffers: shared hit=1234, temp read=553 written=553 ``` **Diagnosis**: This is a **Configuration Failure**. The default `work_mem` (4MB) is insufficient to hold the sort buffer for 200,000 rows in memory. Postgres falls back to an **External Merge Sort** — writing intermediate sort runs to disk temp files. The `wait_event` you will observe is **`IO:BufFileWrite`** — the engine explicitly waiting for temporary file I/O to complete as it spills the huge sort to disk. ```bash # Trace the sort spill sudo ./pg_wait_tracer --view system_event --count 1 ``` **What you will see**: ```text Stat Name Time (ms) % DB Time ──────────────────────────────────── ────────── ────────── DB Time 10250.0 100.0% CPU* 3650.0 35.6% IO 6450.0 62.9% BufFileWrite 4250.0 41.5% BufFileRead 2200.0 21.4% ``` #### The Lazy Fix Increase `work_mem` for this session (or globally for queries that require large sorts): ```sql -- ✅ The Fix: Increase the Private Desk budget for this session SET work_mem = '64MB'; EXPLAIN (ANALYZE, BUFFERS) SELECT a.name, o.order_time, sum(oi.quantity * d.price) AS invoice_total FROM animals a JOIN orders o ON o.animal_id = a.id JOIN order_items oi ON oi.order_id = o.id JOIN dishes d ON d.id = oi.dish_id GROUP BY a.name, o.order_time, o.id ORDER BY a.name, o.order_time; ``` **The Reward**: The plan shows `Sort Method: quicksort Memory: 32MB` — entirely in-memory. The `temp read/written` buffers drop to zero. The wait event disappears. > [!CAUTION] > **The `work_mem` Trap**: `work_mem` is allocated **per sort/hash operation, per process**. If 100 concurrent sessions each have 2 sort operations, setting `work_mem = 64MB` can allocate `100 × 2 × 64MB = 12.8GB` of RAM without warning. Always prefer **session-level** `SET work_mem` for expensive queries over global configuration changes. --- ### Challenge 5: The Full Matrix (A Multi-Signal Regression) **The Complaint**: "Everything is slow. I don't know where to start." This is the most realistic challenge of all. The following query synthesizes three potential bottlenecks simultaneously. Your job is to identify which signal is the *primary* constraint. #### The Investigation Trace Run the combined `pg_wait_tracer` dump commands while your simulated load is running: ```bash # Capture 10 seconds of exact cluster activity sudo ./pg_wait_tracer --count 1 --interval 10 --dump /tmp/traces --pgdata /var/lib/postgresql/data ``` **Interpreting the Tracer AAS Breakdown**: ```text --- 2024-05-20 15:00:10 --- ════════════════════════════════════════════════════════════════════════════════ pg_wait_tracer — Time Model Backends: 25 Interval: 10s ════════════════════════════════════════════════════════════════════════════════ Wait Class AAS % DB Time CPU% Wait% ────────── ────── ──────── ────── ────── CPU 14.20 56.8% 100% 0% Lock 9.15 32.6% 0% 100% IO 3.65 10.6% 0% 100% ``` | What you see | Primary Signal | Chapter Reference | | :--- | :--- | :--- | | `wait_event IS NULL`, `query_age` growing | CPU-Bound (Complexity Failure) | [[Manuscript/07 - Wait Events & Concurrency/7.7 - CPU & Algorithms|7.7 CPU & Algorithms]] | | `wait_event_type = 'IO'`, `wait_event = 'DataFileRead'` | I/O-Bound (Throughput Failure) | [[Manuscript/07 - Wait Events & Concurrency/7.3 - Storage Mechanics (The Shipping Container)|7.3 Storage Mechanics]] | | `wait_event_type = 'IO'`, `wait_event = 'WALWrite'` | WAL I/O Saturated | [[Manuscript/07 - Wait Events & Concurrency/7.2 - WAL Mechanics (The Sequential Diary)|7.2 WAL Mechanics]] | | `wait_event_type = 'Lock'`, `wait_event = 'relation'` | DDL Blocking Readers | [[Manuscript/07 - Wait Events & Concurrency/7.5 - Heavyweight Locking (The Narrow Gate)|7.5 Heavyweight Locking]] | | `wait_event_type = 'Lock'`, `wait_event = 'transactionid'` | Row-Level Contention | [[Manuscript/07 - Wait Events & Concurrency/7.5 - Heavyweight Locking (The Narrow Gate)|7.5 Heavyweight Locking]] | | `wait_event_type = 'LWLock'`, `wait_event = 'WALWrite'` | Parallel WAL Hotspot | [[Manuscript/07 - Wait Events & Concurrency/7.2 - WAL Mechanics (The Sequential Diary)|7.2 WAL Mechanics]] | | `wait_event_type = 'IPC'`, `wait_event = 'ExecuteGather'` | Parallel Worker Coordination | [[Manuscript/07 - Wait Events & Concurrency/7.6 - Parallelism & Coordination|7.6 Parallelism & Coordination]] | | `wait_event_type = 'Client'`, `wait_event = 'ClientRead'` | App holding idle connection | [[Manuscript/07 - Wait Events & Concurrency/7.1.1 - Connectivity & Network (The Starvation)|7.1.1 Connectivity & Network]] | > [!IMPORTANT] > **The Diagnostic Axis** (one final time): `EXPLAIN (ANALYZE, BUFFERS)` tells you **what** Postgres is doing. `pg_stat_activity` tells you **why** it has stopped. You always need both. %% --- | ← Previous | ↑ Table of Contents | Next → | | :--- | :---: | ---: | | [[Manuscript/07 - Wait Events & Concurrency/7.8 - Concurrency & Performance Reasoning|7.8 Performance Reasoning]] | [[Manuscript/00 - Introduction/Index|Home]] | [[Manuscript/07 - Wait Events & Concurrency/7.10 - Advanced Concurrency (The Stress Test)|7.10 Stress Test]] | %%