## 7.9 Performance Audit (Wait Event Exercises)
![[assets/arch_performance_audit.png|450]]
Welcome to the **Performance Audit**.
In Chapter 4, the subject of investigation was always the Query Plan. Here, the subject is the *entire running database*. The wait event is the primary symptom. **`pg_wait_tracer`** provides the high-precision timeline. `EXPLAIN (ANALYZE, BUFFERS)` provides the architectural audit.
For each challenge, a query has gone wrong. Run it against your local Elephant Cafe database, observe both the plan output *and* what the **`pg_wait_tracer`** reveals, and then apply the fix.
> [!NOTE]
> **Setup**: These exercises work best with a large dataset. Run `psql -f scripts/seed_large.sql` to populate the database with 100,000+ orders and 20,000 animals. Then run `ANALYZE;` to refresh the planner's statistics.
---
### Challenge 1: The Silent Sweat (Classifying a NULL Wait Event)
**The Complaint**: "Our nightly report is slow. I checked `pg_stat_activity` and the process doesn't seem to be waiting for anything!"
**The Clue**: A NULL `wait_event` during high CPU usage.
#### The Naive Query
```sql
-- ❌ The Slow Nightly Report
EXPLAIN (ANALYZE, BUFFERS)
SELECT
s.name AS species,
s.diet_type,
count(o.id) AS total_orders,
sum(oi.quantity) AS total_items,
avg(i.base_cost_per_kg) AS avg_ingredient_cost
FROM animals a
JOIN species s ON a.species_id = s.id
JOIN orders o ON o.animal_id = a.id
JOIN order_items oi ON oi.order_id = o.id
JOIN dishes d ON d.id = oi.dish_id
JOIN dish_ingredients di ON di.dish_id = d.id
JOIN ingredients i ON i.id = di.ingredient_id
WHERE extract(year FROM o.order_time) = 2024
GROUP BY s.name, s.diet_type
ORDER BY total_orders DESC;
```
#### The Audit Log
To see this in action, we can simulate multiple waiters hitting this exact report simultaneously using `pgbench`:
```bash
# Save the bad query to a file
echo "
SELECT s.name, sum(oi.quantity), avg(i.base_cost_per_kg)
FROM animals a
JOIN species s ON a.species_id = s.id
JOIN orders o ON o.animal_id = a.id
JOIN order_items oi ON oi.order_id = o.id
JOIN dishes d ON d.id = oi.dish_id
JOIN dish_ingredients di ON di.dish_id = d.id
JOIN ingredients i ON i.id = di.ingredient_id
WHERE extract(year FROM o.order_time) = 2024
GROUP BY s.name;
" > slow_report.sql
# Simulate 4 concurrent analysts running it repeatedly for 10 seconds
pgbench -c 4 -T 10 -f slow_report.sql -U postgres elephant_cafe
```
While that is running, trace it using `pg_wait_tracer` in another terminal:
```bash
# Capture and dump the precise wait summary
sudo ./pg_wait_tracer --count 1 --interval 5 --view time_model --pgdata /var/lib/postgresql/data
```
**What you will see**:
```text
--- 2024-05-20 14:30:05 ---
════════════════════════════════════════════════════════════════════════════════
pg_wait_tracer — Time Model Backends: 12 Interval: 5s
════════════════════════════════════════════════════════════════════════════════
Wait Class AAS % DB Time CPU% Wait%
────────── ────── ──────── ────── ──────
CPU 4.96 99.1% 100% 0%
Client 1.04 1.9% 0% 100%
IO 1.00 0% 0% 0%
```
The tracer returns an Average Active Sessions (AAS) value very close to `5.0` (your 4 pgbench connections). But more importantly, the `Wait%` for the cluster is near `0%`, and the `CPU%` is `100%`. There is no `DataFileRead`. The process is completely CPU-bound.
The EXPLAIN output reveals why:
```text
-- Key Plan Signals:
Gather (actual time=2.234..8923.456 rows=5 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (...)
-> Parallel Hash Join (...)
-> Parallel Seq Scan on orders o (...)
Filter: (EXTRACT(year FROM order_time) = 2024)
Rows Removed by Filter: 150000 -- ⚠️ Too much work
```
**Diagnosis**: This is a **Complexity Failure** (The Hamster Wheel). The `extract(year FROM o.order_time)` predicate is non-sargable — it forces a full Sequential Scan of the orders table and then discards 150,000 rows. The 7-table join is then computed over the full unfiltered dataset.
#### The Lazy Fix
Unwrap the date function. Give the planner a range it can use with the `idx_orders_order_time` index:
```sql
-- ✅ The Fix: Sargable date range
EXPLAIN (ANALYZE, BUFFERS)
SELECT
s.name AS species,
s.diet_type,
count(o.id) AS total_orders,
sum(oi.quantity) AS total_items,
avg(i.base_cost_per_kg) AS avg_ingredient_cost
FROM animals a
JOIN species s ON a.species_id = s.id
JOIN orders o ON o.animal_id = a.id
JOIN order_items oi ON oi.order_id = o.id
JOIN dishes d ON d.id = oi.dish_id
JOIN dish_ingredients di ON di.dish_id = d.id
JOIN ingredients i ON i.id = di.ingredient_id
WHERE o.order_time >= '2024-01-01' AND o.order_time < '2025-01-01'
GROUP BY s.name, s.diet_type
ORDER BY total_orders DESC;
```
**The Reward**: The plan's entry point shifts from a Parallel Seq Scan to a targeted **Index Scan** on `order_time`. The 7-table join is now computed against only the 2024 rows, not all 200,000.
> [!IMPORTANT]
> **Classification**: `wait_event IS NULL` + high CPU + long execution = **Complexity Failure**. The fix is always architectural: fix the plan, not the hardware.
---
### Challenge 2: Reading the Elevator (I/O Wait Triage)
**The Complaint**: "A simple aggregation over supply deliveries takes 12 seconds in the morning. CPU looks idle. Something else is going on."
**The Clue**: `wait_event_type = 'IO'`, `wait_event = 'DataFileRead'`.
#### The Naive Query
```sql
-- ❌ The Cold-Cache Scan
EXPLAIN (ANALYZE, BUFFERS)
SELECT
supplier_id,
count(*) AS delivery_count,
sum(quantity_kg) AS total_kg
FROM supply_deliveries
GROUP BY supplier_id
ORDER BY total_kg DESC;
```
#### The Audit Log
```bash
# Simulate 10 connections hitting the un-cached aggregate
echo "SELECT supplier_id, count(*) FROM supply_deliveries GROUP BY supplier_id;" > scan.sql
pgbench -c 10 -T 10 -f scan.sql -U postgres elephant_cafe
```
While pgbench fires, record the trace:
```bash
sudo ./pg_wait_tracer --count 1 --interval 5 --view system_event --pgdata /var/lib/postgresql/data
```
**What you will see**:
```text
--- 2024-05-20 14:40:05 ---
════════════════════════════════════════════════════════════════════════════════
pg_wait_tracer — System Events Backends: 18 Interval: 5s
════════════════════════════════════════════════════════════════════════════════
Wait Class Wait Event AAS % DB Time Wait (ms)
────────── ─────────────── ────── ──────── ────────
IO DataFileRead 10.12 91.2% 45600
CPU — 1.85 9.5% 4250
Client ClientRead 1.03 1.3% 150
```
The tracer reports an AAS near `10.0`. But unlike the CPU bound query, the Time Model shows `Wait%` at `90%+`. The `IO:DataFileRead` event dominates the stack. The CPU is idle, waiting for the Elevator.
```text
-- Key Plan Signals:
Seq Scan on supply_deliveries (actual time=1.023..1823.234 rows=100000 loops=1)
Buffers: shared hit=3 read=694 -- ⚠️ 694 physical page reads!
```
**Diagnosis**: This is a **Throughput Failure** (The Elevator Queue). The `supply_deliveries` table's 694 pages do not fit in `shared_buffers`. Each page must be fetched from physical disk — the "elevator" bringing suitcases up from the Filing Cabinet.
#### The Fix Strategy
This challenge has no single SQL fix — it is a **configuration and hardware decision**:
```sql
-- Diagnostic 1: How large is the table in pages?
SELECT relname, relpages, pg_size_pretty(pg_relation_size(oid))
FROM pg_class WHERE relname = 'supply_deliveries';
-- Diagnostic 2: What fraction of reads are buffer hits?
SELECT
relname,
heap_blks_read,
heap_blks_hit,
round(heap_blks_hit::numeric / (heap_blks_hit + heap_blks_read + 1) * 100, 2) AS hit_pct
FROM pg_statio_user_tables
WHERE relname = 'supply_deliveries';
```
**Fix Ladder** (ordered by impact):
1. **Increase `shared_buffers`**: The Warming Rack. If the table's working set fits in RAM, the reads disappear on the second execution.
2. **Add a targeted index**: If the `GROUP BY supplier_id` query is frequent, an index on `supplier_id` can allow an **Index Scan** with far fewer page reads.
3. **Upgrade storage to SSD**: If the data is genuinely too large to cache, faster I/O is the only hardware lever.
> [!TIP]
> **The Buffer Hit Ratio**: A healthy OLTP database should have a buffer hit rate above **99%**. If you see 60–80%, your `shared_buffers` is too small for your working set. Add RAM before you add indexes.
---
### Challenge 3: The Stuck Frog (Diagnosing Lock Contention)
**The Complaint**: "Our order processing API has started queuing up. Simple `UPDATE` calls are taking 30+ seconds."
**The Clue**: `wait_event_type = 'Lock'`, `wait_event = 'transactionid'`.
#### The Audit Log
In your first session, simulate a long-running transaction that holds a row lock:
```sql
-- Session 1: The Blocker (do not COMMIT yet)
BEGIN;
UPDATE orders SET status = 'Cooking' WHERE id = 1;
-- ↑ This holds a RowExclusive lock and an exclusive lock on its own XID.
```
In a second session, simulate a competing update:
```sql
-- Session 2: The Blocked Process
UPDATE orders SET status = 'Cancelled' WHERE id = 1;
-- ↑ This will hang, waiting for Session 1's transaction to release.
```
While Session 2 is waiting, inspect the active sessions using the tracer's **Active view** to identify the bottleneck:
```bash
# Observe the active sessions in real-time
sudo ./pg_wait_tracer --view active --count 1
```
**What you will see**:
```text
PID State Wait Event Wait (ms) DB Time (ms) Backend Type
────── ───────── ─────────────────── ────────── ──────────── ────────────
8551 waiting Lock:transactionid 14200.0 15800.0 client
8505 idle — — 240.5 client
```
Session 2 (PID 8551) is `waiting` on a `transactionid` lock. Because the database is otherwise quiet (PID 8505 is `idle`), we know that 8505 is the "Blocker"—it holds a row lock but is doing nothing, likely because a user opened a transaction and forgot to finish it.
**Diagnosis**: This is a **Coordination Failure** (The Stuck Frog / Traffic Jam). Improving hardware is entirely useless. The fix is in **transaction management**.
#### The Fix Strategies
```sql
-- Fix 1: Always commit or rollback promptly in application code.
-- The blocker is Session 1 holding an idle open transaction.
-- Commit it: COMMIT;
-- Fix 2: Use lock_timeout to prevent indefinite waits
SET lock_timeout = '5s';
UPDATE orders SET status = 'Cancelled' WHERE id = 1;
-- This will raise an error after 5s rather than hanging indefinitely.
-- Fix 3: Use advisory locks for application-level sequencing
-- (For cases where table-level row contention is unavoidable)
-- Fix 4: Find and kill a truly stuck blocker (use carefully in production)
SELECT pg_terminate_backend(<blocking_pid>);
```
> [!WARNING]
> **The Lock Cascade**: Lock contention doesn't stay isolated. If Process A is blocked by Process B, and 20 application threads all issue the same `UPDATE`, they will all queue up behind Process A. One forgotten `BEGIN` with no `COMMIT` can stall an entire API tier.
---
### Challenge 4: The Overflowing Suitcase (work_mem and Sort Spill)
**The Complaint**: "Our monthly invoice sort is using a lot of disk. The `temp_files` counter in `pg_stat_database` is growing."
**The Clue**: `wait_event = 'BufFileWrite'`, `Sort Method: external merge Disk:` in EXPLAIN.
#### The Naive Query
```sql
-- ❌ The Memory-Starved Sort
EXPLAIN (ANALYZE, BUFFERS)
SELECT
a.name,
o.order_time,
sum(oi.quantity * d.price) AS invoice_total
FROM animals a
JOIN orders o ON o.animal_id = a.id
JOIN order_items oi ON oi.order_id = o.id
JOIN dishes d ON d.id = oi.dish_id
GROUP BY a.name, o.order_time, o.id
ORDER BY a.name, o.order_time;
```
#### The Fallout
On a large dataset, the plan will contain:
```text
-- Key Plan Signals:
Sort (actual time=5234.123..5891.456 rows=200000 loops=1)
Sort Key: a.name, o.order_time
Sort Method: external merge Disk: 8864kB -- ⚠️ Spilled to disk!
Buffers: shared hit=1234, temp read=553 written=553
```
**Diagnosis**: This is a **Configuration Failure**. The default `work_mem` (4MB) is insufficient to hold the sort buffer for 200,000 rows in memory. Postgres falls back to an **External Merge Sort** — writing intermediate sort runs to disk temp files.
The `wait_event` you will observe is **`IO:BufFileWrite`** — the engine explicitly waiting for temporary file I/O to complete as it spills the huge sort to disk.
```bash
# Trace the sort spill
sudo ./pg_wait_tracer --view system_event --count 1
```
**What you will see**:
```text
Stat Name Time (ms) % DB Time
──────────────────────────────────── ────────── ──────────
DB Time 10250.0 100.0%
CPU* 3650.0 35.6%
IO 6450.0 62.9%
BufFileWrite 4250.0 41.5%
BufFileRead 2200.0 21.4%
```
#### The Lazy Fix
Increase `work_mem` for this session (or globally for queries that require large sorts):
```sql
-- ✅ The Fix: Increase the Private Desk budget for this session
SET work_mem = '64MB';
EXPLAIN (ANALYZE, BUFFERS)
SELECT
a.name,
o.order_time,
sum(oi.quantity * d.price) AS invoice_total
FROM animals a
JOIN orders o ON o.animal_id = a.id
JOIN order_items oi ON oi.order_id = o.id
JOIN dishes d ON d.id = oi.dish_id
GROUP BY a.name, o.order_time, o.id
ORDER BY a.name, o.order_time;
```
**The Reward**: The plan shows `Sort Method: quicksort Memory: 32MB` — entirely in-memory. The `temp read/written` buffers drop to zero. The wait event disappears.
> [!CAUTION]
> **The `work_mem` Trap**: `work_mem` is allocated **per sort/hash operation, per process**. If 100 concurrent sessions each have 2 sort operations, setting `work_mem = 64MB` can allocate `100 × 2 × 64MB = 12.8GB` of RAM without warning. Always prefer **session-level** `SET work_mem` for expensive queries over global configuration changes.
---
### Challenge 5: The Full Matrix (A Multi-Signal Regression)
**The Complaint**: "Everything is slow. I don't know where to start."
This is the most realistic challenge of all. The following query synthesizes three potential bottlenecks simultaneously. Your job is to identify which signal is the *primary* constraint.
#### The Investigation Trace
Run the combined `pg_wait_tracer` dump commands while your simulated load is running:
```bash
# Capture 10 seconds of exact cluster activity
sudo ./pg_wait_tracer --count 1 --interval 10 --dump /tmp/traces --pgdata /var/lib/postgresql/data
```
**Interpreting the Tracer AAS Breakdown**:
```text
--- 2024-05-20 15:00:10 ---
════════════════════════════════════════════════════════════════════════════════
pg_wait_tracer — Time Model Backends: 25 Interval: 10s
════════════════════════════════════════════════════════════════════════════════
Wait Class AAS % DB Time CPU% Wait%
────────── ────── ──────── ────── ──────
CPU 14.20 56.8% 100% 0%
Lock 9.15 32.6% 0% 100%
IO 3.65 10.6% 0% 100%
```
| What you see | Primary Signal | Chapter Reference |
| :--- | :--- | :--- |
| `wait_event IS NULL`, `query_age` growing | CPU-Bound (Complexity Failure) | [[Manuscript/07 - Wait Events & Concurrency/7.7 - CPU & Algorithms|7.7 CPU & Algorithms]] |
| `wait_event_type = 'IO'`, `wait_event = 'DataFileRead'` | I/O-Bound (Throughput Failure) | [[Manuscript/07 - Wait Events & Concurrency/7.3 - Storage Mechanics (The Shipping Container)|7.3 Storage Mechanics]] |
| `wait_event_type = 'IO'`, `wait_event = 'WALWrite'` | WAL I/O Saturated | [[Manuscript/07 - Wait Events & Concurrency/7.2 - WAL Mechanics (The Sequential Diary)|7.2 WAL Mechanics]] |
| `wait_event_type = 'Lock'`, `wait_event = 'relation'` | DDL Blocking Readers | [[Manuscript/07 - Wait Events & Concurrency/7.5 - Heavyweight Locking (The Narrow Gate)|7.5 Heavyweight Locking]] |
| `wait_event_type = 'Lock'`, `wait_event = 'transactionid'` | Row-Level Contention | [[Manuscript/07 - Wait Events & Concurrency/7.5 - Heavyweight Locking (The Narrow Gate)|7.5 Heavyweight Locking]] |
| `wait_event_type = 'LWLock'`, `wait_event = 'WALWrite'` | Parallel WAL Hotspot | [[Manuscript/07 - Wait Events & Concurrency/7.2 - WAL Mechanics (The Sequential Diary)|7.2 WAL Mechanics]] |
| `wait_event_type = 'IPC'`, `wait_event = 'ExecuteGather'` | Parallel Worker Coordination | [[Manuscript/07 - Wait Events & Concurrency/7.6 - Parallelism & Coordination|7.6 Parallelism & Coordination]] |
| `wait_event_type = 'Client'`, `wait_event = 'ClientRead'` | App holding idle connection | [[Manuscript/07 - Wait Events & Concurrency/7.1.1 - Connectivity & Network (The Starvation)|7.1.1 Connectivity & Network]] |
> [!IMPORTANT]
> **The Diagnostic Axis** (one final time): `EXPLAIN (ANALYZE, BUFFERS)` tells you **what** Postgres is doing. `pg_stat_activity` tells you **why** it has stopped. You always need both.
%%
---
| ← Previous | ↑ Table of Contents | Next → |
| :--- | :---: | ---: |
| [[Manuscript/07 - Wait Events & Concurrency/7.8 - Concurrency & Performance Reasoning|7.8 Performance Reasoning]] | [[Manuscript/00 - Introduction/Index|Home]] | [[Manuscript/07 - Wait Events & Concurrency/7.10 - Advanced Concurrency (The Stress Test)|7.10 Stress Test]] |
%%