restored_7_10 - Aesa's Notes

## 7.10 Advanced Concurrency (The Stress Test) ![[assets/arch_stress_test.png|450]] If the Performance Audit was about reading a single isolated failure, the **Stress Test** is about analyzing a cascading collapse under extreme load. Real production databases do not suffer from one clean bottleneck at a time. They suffer from CPU saturation *while* a lock cascade is forming *while* autovacuum is trying to freeze the most actively-written table in the system. This lab recreates those conditions using **`pg_wait_tracer`** to provide clinical proof. > [!IMPORTANT] > **Setup**: These challenges require the large dataset. Run `psql -f scripts/seed_large.sql` if you haven't already. To generate the workloads, you will need to open multiple `psql` sessions or use `pgbench`. --- ### Challenge 1: Engineering a Deadlock (and Surviving It) **The Goal**: Intentionally create a deadlock, observe Postgres resolve it, and learn how to prevent it in application code. #### The Setup Open **two separate `psql` sessions** and run these transactions concurrently (interleave them one statement at a time): ```sql -- === SESSION 1 === BEGIN; -- Step 1: Lock order 1 UPDATE orders SET status = 'Cooking' WHERE id = 1; ``` ```sql -- === SESSION 2 === BEGIN; -- Step 2: Lock order 2 UPDATE orders SET status = 'Cooking' WHERE id = 2; ``` ```sql -- === SESSION 1 === -- Step 3: Try to lock order 2 (held by Session 2) -- Session 1 will now WAIT. UPDATE orders SET status = 'Served' WHERE id = 2; ``` ```sql -- === SESSION 2 === -- Step 4: Try to lock order 1 (held by Session 1) -- This creates the deadlock cycle. Postgres will detect and kill one. UPDATE orders SET status = 'Served' WHERE id = 1; ``` The surviving session can proceed and `COMMIT`. #### The Forensic Evidence To catch a deadlock *before* it breaks, or to understand the contention pattern in a high-concurrency system, use the **Active view**: ```bash # Observe the race in real-time sudo ./pg_wait_tracer --view active --count 1 ``` **What you will see**: ```text PID State Wait Event Wait (ms) DB Time (ms) Backend Type ────── ───────── ─────────────────── ────────── ──────────── ──────────── 12345 waiting Lock:transactionid 850.5 4200.0 client 67890 waiting Lock:transactionid 420.2 3800.0 client ``` Both processes are stuck in `Lock:transactionid` simultaneously. In a deadlock, they will remain in this state until the detector fires. #### The Prevention Pattern Deadlocks are almost always caused by inconsistent lock ordering. The fix is architectural: **always acquire locks in the same order** across all code paths. ```sql -- ✅ The Prevention: Always lock in ascending ID order -- Application code should sort the IDs before issuing updates. -- In a single transaction, process IDs [2, 1] → sort to [1, 2] first: BEGIN; UPDATE orders SET status = 'Cooking' WHERE id = 1; -- Always lock lower ID first UPDATE orders SET status = 'Cooking' WHERE id = 2; -- Then higher ID COMMIT; ``` > [!WARNING] > **The application's responsibility**: Postgres will always resolve deadlocks eventually, but it does so by aborting a transaction—which means lost work and retry overhead for your application. Application-level lock ordering is the only reliable prevention strategy. --- ### Challenge 2: The Lock Upgrade Trap (Share → Exclusive) **The Goal**: Discover why reading before writing can create unnecessary contention. #### The Problem Pattern A common application pattern is: (1) read a row to check its current state, then (2) update it conditionally. When done naively with two separate statements, this creates a dangerous window. ```sql -- ❌ The Two-Statement Pattern (creates a lock upgrade gap) -- Session 1: BEGIN; SELECT status FROM orders WHERE id = 42; -- Application logic checks: "Is it Pending? Yes. Update it." -- ... but Session 2 could UPDATE id=42 between the SELECT and UPDATE ... UPDATE orders SET status = 'Cooking' WHERE id = 42 AND status = 'Pending'; COMMIT; ``` Between the `SELECT` and the `UPDATE`, another session can change `order 42`. The second session's `UPDATE` will silently update 0 rows (because `status = 'Pending'` no longer matches), without any error. #### The Fix: Atomic Check-and-Set ```sql -- ✅ Pattern 1: Collapse into a single sargable UPDATE -- No read needed; the WHERE clause is the atomic check. UPDATE orders SET status = 'Cooking' WHERE id = 42 AND status = 'Pending' RETURNING id, status; -- If rows = 0, the status was already changed by someone else. ``` ```sql -- ✅ Pattern 2: SELECT FOR UPDATE to hold the lock during the check BEGIN; SELECT status FROM orders WHERE id = 42 FOR UPDATE; -- Now this session holds an exclusive RowLock on id=42. -- Any other session attempting UPDATE id=42 will block until COMMIT. UPDATE orders SET status = 'Cooking' WHERE id = 42; COMMIT; ``` #### The Investigation Trace When using `SELECT FOR UPDATE` under load, you will see the **Lock** wait class spike in the **Active view**: ```bash # Capture the upgrade contention sudo ./pg_wait_tracer --view active --count 1 ``` **What you will see**: ```text PID State Wait Event Wait (ms) DB Time (ms) Backend Type ────── ───────── ─────────────────── ────────── ──────────── ──────────── 9001 waiting Lock:transactionid 450.0 8900.5 client 8999 on cpu — — 9200.2 client ``` By correlating the PIDs, you can see that 9001 is waiting on a lock likely held by 8999. > [!TIP] > **`RETURNING` is the most efficient pattern** for check-and-set operations in Postgres. It collapses the read, the conditional check, and the write into a single atomic statement with no lock upgrade gap and no extra round-trip. --- ### Challenge 3: The Vacuum Interruption (Autovacuum vs. DDL) **The Goal**: Understand why long-running transactions can cause table bloat to explode during high-write periods. #### The Mechanics Autovacuum cannot reclaim a dead tuple if any active transaction has a snapshot older than the tuple's `xmax`. This is the **transaction horizon** problem. ```sql -- Simulate a long-running read that blocks vacuum -- Session 1: Open a long transaction with an old snapshot BEGIN; SELECT count(*) FROM animals; -- Acquires snapshot at current XID -- Session 2: In another session, cause heavy writes -- This creates thousands of dead tuples that Autovacuum cannot reclaim. DELETE FROM supply_deliveries WHERE delivery_time < '2023-01-01'; -- The autovacuum will start but will be unable to advance past -- Session 1's snapshot horizon. ``` **Investigation Trace**: Instead of manual SQL, use the tracer to confirm that Autovacuum has indeed stopped making progress: ```bash # Look for the Autovacuum worker and its state sudo ./pg_wait_tracer --view active --count 1 ``` **What you will see**: ```text PID State Wait Event Wait (ms) DB Time (ms) Backend Type ────── ───────── ─────────────────── ────────── ──────────── ──────────── 102 on cpu — — 45000.0 autovacuum worker 105 idle — — 1200.5 client ``` If the Autovacuum worker is `on cpu` but its `n_dead_tup` isn't decreasing, and PID 105 has a very old transaction start, you have found the horizon wall. **Diagnosis**: A high `dead_pct` combined with a `last_autovacuum` timestamp that is "stuck" (not advancing) points directly to an old transaction horizon blocking the Housekeepers. #### The Resolution ```sql -- Option 1: Commit or rollback the old transaction (Session 1) -- This releases the snapshot horizon and allows Autovacuum to proceed. COMMIT; -- or ROLLBACK; -- Option 2: Force a manual vacuum after clearing the old transaction VACUUM ANALYZE supply_deliveries; -- Option 3: For emergency bloat, use VACUUM FULL (locks the table!) -- Use only when ordinary VACUUM is insufficient. VACUUM FULL supply_deliveries; ``` > [!CAUTION] > **`VACUUM FULL` acquires `AccessExclusiveLock`**. It rewrites the entire table and blocks all reads and writes for the duration. It should only be used as an emergency recovery measure — not as a routine maintenance strategy. --- ### Challenge 4: The DDL Landmine (The AccessExclusive Lock Queue) **The Goal**: Experience how a single DDL statement can create a thundering herd of blocked processes. #### The Setup This is the most dangerous scenario in a production system. Run the following in sequence, observing the lock queue: ```sql -- Session 1: A long-running read (common in OLAP workloads) BEGIN; SELECT count(*) FROM orders -- This holds an AccessShare lock JOIN order_items oi ON oi.order_id = orders.id; -- Do NOT commit yet. ``` ```sql -- Session 2: A schema migration (ALTER TABLE) -- This will BLOCK waiting for Session 1 to release its AccessShare lock. ALTER TABLE orders ADD COLUMN notes TEXT; -- ↑ Requires AccessExclusive lock. All subsequent reads will also BLOCK here. ``` ```sql -- Session 3: A new application read (arrives AFTER the ALTER) -- This will also BLOCK, queued behind Session 2's AccessExclusive request! SELECT * FROM orders WHERE id = 999; ``` **What Happens**: Session 2's `ALTER TABLE` is queued behind Session 1. But because it holds a lock *request* for `AccessExclusive`, Session 3's `SELECT` — which would normally be fine — is also forced to queue behind Session 2. **One long-running read + one DDL = all subsequent reads blocked.** #### The Investigation Trace This "Thundering Herd" is clearly visible in the **Active view**, where multiple PIDs are stuck on the same `relation` lock concurrently. ```bash # Observe the DDL lock queue sudo ./pg_wait_tracer --view active --count 1 ``` **What you will see**: ```text PID State Wait Event Wait (ms) DB Time (ms) Backend Type ────── ───────── ─────────────────── ────────── ──────────── ──────────── 300 waiting Lock:relation 8900.0 9500.5 client 200 waiting Lock:relation 9200.0 9800.2 client 100 on cpu — — 12000.0 client ``` Until you kill or commit PID 100, the entire queue remains frozen. #### The Resolution Patterns ```sql -- Option 1: Always set lock_timeout for DDL in production SET lock_timeout = '3s'; ALTER TABLE orders ADD COLUMN notes TEXT; -- If it can't get the lock in 3s, it errors out cleanly -- rather than creating an indefinite queue. ``` ```sql -- Option 2: Use pg_cancel_backend to unblock the DDL if Application allows SELECT pg_cancel_backend(<session1_pid>); ``` ```sql -- Option 3: For additive-only changes, consider online methods -- Adding a column with a DEFAULT requires a full table rewrite (pre-PG11). -- In PG11+, adding a nullable column with no DEFAULT is instant (no rewrite). -- Check pg_repack extension for REINDEX/VACUUM FULL without locking. ``` > [!IMPORTANT] > **Production Rule of Thumb**: Never issue a `VACUUM FULL`, `ALTER TABLE` (with rewrite), or `REINDEX` during peak traffic hours without `lock_timeout` and a migration plan to cancel if the lock queue grows. --- ### Challenge 5: The Hot Page (LWLock:Buffer_content) **The Goal**: Witness memory-level contention (Latches) during extreme concurrency. #### The Setup We will simulate a "Hot Page" scenario where every single worker is trying to update the exact same order at the same time. ```bash # Force 30 concurrent connections to update exactly order ID 1 echo "UPDATE orders SET status = 'Cooking' WHERE id = 1;" > hot_page.sql pgbench -c 30 -T 10 -f hot_page.sql -U postgres elephant_cafe ``` #### The Investigation Trace ```bash sudo ./pg_wait_tracer --view system_event --interval 5 --count 1 ``` **What you will see**: ```text Wait Class Wait Event AAS % DB Time Wait (ms) ────────── ─────────────── ────── ──────── ──────── LWLock Buffer_content 28.42 94.7% 142100 CPU — 2.58 6.3% 7900 ``` **Diagnosis**: This is a **Latch Contention** (The Narrow Gate). Unlike row locks (`Lock:transactionid`), where waiters are civil and sleep until notified, `LWLock:Buffer_content` indicates that backends are fighting over the memory-level lock protecting the physical 8KB page in the **Warming Rack**. The Elephant is spending more time managing the line than serving the coffee. --- ### Challenge 6: The WAL Jam (LWLock:WALWrite) **The Goal**: Identify when the Pocket Diary (Disk) cannot keep up with the rate of incoming transactions. #### The Setup We will run a "Commits-only" workload with no batching, forcing a synchronous WAL write for every tiny update. ```bash # Force maximum WAL pressure with 10 connections doing un-batched updates pgbench -c 10 -T 10 -S -U postgres elephant_cafe ``` #### The Investigation Trace ```bash sudo ./pg_wait_tracer --view time_model --interval 5 --count 1 ``` **What you will see**: ```text Wait Class Wait Event AAS % DB Time Wait (ms) ────────── ─────────────── ────── ──────── ──────── LWLock WALWrite 8.12 71.2% 35600 IO WALWrite 3.34 23.4% 11700 CPU — 1.54 6.4% 2700 ``` **Diagnosis**: This is a **Write Path Throughput Failure**. `IO:WALWrite` is the hardware limit—the disk actually writing the bytes. `LWLock:WALWrite` is the software queue—dozens of backends waiting for the single `WALWriter` thread to finish its task so they can commit. If `LWLock:WALWrite` dominates your stack, you don't need "better SQL." You need **faster storage** (NVMe) or **transaction batching**. --- ### Challenge 7: The Full Incident Response (Multi-Signal Production Triage) **The Full Triage Protocol**: When the system is collapsing, follow this exact sequence. #### Step 1: The Multi-Window Trace (`pg_wait_tracer`) ```bash # Get a clear picture of what the cluster is doing right now vs the last 5 minutes sudo ./pg_wait_tracer --view system_event --window 5s,1m,5m --pgdata /var/lib/postgresql/data ``` ```text Wait Event Last 5s (AAS) Last 1m (AAS) Last 5m (AAS) ─────────── ────────────── ────────────── ────────────── Lock:transactionid 9.42 2.05 1.21 CPU 5.10 4.95 5.02 IO:DataFileRead 1.12 1.15 1.14 ``` - **AAS spike in the 5s window but zero in 5m?** → You are looking at a transient lock storm. - **Dominated by `CPU` across all windows?** → Sustained CPU saturation. Jump to `EXPLAIN`. - **Many `IO:DataFileRead`?** → Throughput Failure. Check `shared_buffers` and table size. > [!TIP] > **Use the Dashboard**. While the CLI is fast, the **`pgwt` Web UI** is the best tool for this step. It allows you to drag across that 5-second spike and instantly see which specific query IDs were responsible for the AAS jump. #### Step 2: Find the Blocker Tree Instead of holding a Ph.D. in recursive SQL CTEs to interrogate `pg_locks`, simply use the tracer's built-in waiter→blocker inference. ```bash # Drill into the specific blocking sessions sudo ./pg_wait_tracer --view session_event --pgdata /var/lib/postgresql/data ``` ```text PID State Wait Event Blocker Wait (ms) Query ID ──── ────── ───────────────── ─────── ───────── ────────── 105 active Lock:transactionid → 99 12400 882345... 106 active Lock:transactionid → 105 11900 882345... 99 active CPU (Root Blocker) — 0 441298... ``` The output will automatically reconstruct the **lock chains**, showing exactly which PID is the root blocker at the head of the queue, and how many downstream sessions are stalled behind it. Alternatively, if you are capturing offline traces via the daemon mode (`--daemon`), you can reconstruct the exact timeline retroactively using `pg_wait_tracer --replay` without needing to connect to the database at all! #### Step 3: The Historical Aggregation (`pg_stat_statements`) ```sql -- Which queries have consumed the most total time in the last reset window? SELECT left(query, 80) AS query, calls, round(total_exec_time::numeric / 1000, 2) AS total_seconds, round(mean_exec_time::numeric, 2) AS avg_ms, shared_blks_read AS physical_reads, shared_blks_hit AS buffer_hits FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10; ``` - **High `physical_reads` + high `total_seconds` → I/O bound. Investigate indexes and `shared_buffers`.** - **High `calls` + low `avg_ms` → OLTP hotspot. Profile for lock contention.** - **Low `calls` + extremely high `avg_ms` → OLAP monster. Profile with `EXPLAIN ANALYZE`.** #### Step 4: The Diagnostic Map Use this table as your definitive clinical reference when reading the `pg_wait_tracer` results: | Wait Class | Dominant Event | Diagnosis | Fix Reference | | :--- | :--- | :--- | :--- | | **CPU** | — | **Complexity Failure** | [[Manuscript/04 - Query Planning & Execution/4.0 - Query Planning & Operations|Query Optimization]] | | **IO** | `DataFileRead` | **Throughput Failure** | [[Manuscript/06 - Resource Management & Processes/6.2 - Shared Buffers (The Warming Rack)|Tune Memory]] | | **IO** | `WALWrite` | **Disk Latency** | [[Manuscript/07 - Wait Events & Concurrency/7.2 - WAL Mechanics (The Sequential Diary)|WAL Mechanics]] | | **Lock** | `transactionid` | **Row Contention** | [[Manuscript/07 - Wait Events & Concurrency/7.5 - Heavyweight Locking (The Narrow Gate)|Heavyweight Locking]] | | **Lock** | `relation` | **DDL Blocking** | [[Manuscript/07 - Wait Events & Concurrency/7.10 - Advanced Concurrency (The Stress Test)|Challenge 4]] | | **LWLock** | `Buffer_content` | **Hot Page Latch** | [[Manuscript/07 - Wait Events & Concurrency/7.3 - Storage Mechanics (The Shipping Container)|Storage Mechanics]] | | **LWLock** | `WALWrite` | **Commit Throughput** | [[Manuscript/07 - Wait Events & Concurrency/7.2 - WAL Mechanics (The Sequential Diary)|WAL Mechanics]] | | **IPC** | `ExecuteGather` | **Parallel Skew** | [[Manuscript/07 - Wait Events & Concurrency/7.6 - Parallelism & Coordination|Parallelism & Coordination]] | | **Client** | `ClientRead` | **Idle in Transaction** | [[Manuscript/07 - Wait Events & Concurrency/7.1.1 - Connectivity & Network (The Starvation)|Connectivity & Network]] | #### Step 5: Terminate if Required ```sql -- Cancel a query (graceful — allows cleanup) SELECT pg_cancel_backend(<pid>); -- Terminate a connection (forceful — use only if pg_cancel_backend fails) SELECT pg_terminate_backend(<pid>); ``` > [!CAUTION] > **`pg_terminate_backend` is a last resort.** It forces a hard disconnect on the target session, which triggers a full process restart and rollback of any in-progress transaction. In high-concurrency systems, it can temporarily increase load. Cancel first; terminate only if the cancel has no effect after ~10 seconds. %% --- | ← Previous | ↑ Table of Contents | Next → | | :--- | :---: | ---: | | [[Manuscript/07 - Wait Events & Concurrency/7.9 - Performance Audit (Wait Event Exercises)|7.9 Performance Audit]] | [[Manuscript/00 - Introduction/Index|Home]] | [[Manuscript/08 - Distributed Scaling & Clouds/8.0 - Distributed Storage (The Elephant in the Clouds)|8.0 Distributed Storage]] | %%