# Chapter 7: Wait Events & Concurrency
## 7.0 - Why Slow Queries Lie (The Waiting Game)
<img src="assets/chap_6_modular_wait.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
Postgres is a finely tuned engine of movement, but most of its life is spent in a curious state of **Synchronized Waiting**.
Imagine a query that, according to **[[Operations/_Operations|EXPLAIN ANALYZE]]**, only took 1 second of actual processing time. Yet, the user's stopwatch reported a 5-second hang. Where did the other 4 seconds go?
The missing time was lost to environmental wait time. We call this the **Waiting Game**. To master the engine, you must stop looking solely at what it is *doing* and start measuring what it is *waiting for*.
```bash
# Where did the missing 4 seconds go?
sudo ./pg_wait_tracer --count 1 --interval 5 --view time_model --pgdata /var/lib/postgresql/data
```
```text
Wait Class AAS % DB Time CPU% Wait%
────────── ────── ──────── ────── ──────
IO 3.20 64.0% 0% 100%
CPU 1.00 20.0% 100% 0%
Lock 0.80 16.0% 0% 100%
```
> The missing 4 seconds: 64% I/O (waiting for disk), 16% Lock (waiting for another transaction). Only 20% was productive CPU. This is the "Waiting Game" made visible.
The rest of this chapter explores the **Anatomy of a Slowdown**. We will move layer by layer—from single queries to cluster-wide congestion—treating each failure not as a list of abstract metrics, but as a forensic investigation.
---
## 7.1 - The Diagnostic Views (Sweat vs. Sigh)
<img src="assets/arch_wait_events.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
To observe the engine's behavior, we must move beyond the "Service Receipt" of `EXPLAIN ANALYZE` and look at the live heartbeat of the system.
Before diving into specific events, you must internalize a fundamental binary for Postgres performance. At any microsecond, a process is in one of two states:
### 1. The Sweat (Active CPU)
The process is actively executing on the CPU. It is hashing, sorting, or joining. In `pg_stat_activity`, it has zero wait events (`wait_event` is `NULL`).
When a process is sweating, it is constrained by **logical complexity**. To fix this, you must optimize the algorithm, add an index, or simplify the query logic.
### 2. The Sigh (Sleeping/Waiting)
The process has requested a resource it does not yet have—a disk block, a network packet, or a row lock—and has been put to sleep by the kernel.
When a process is sighing, it is constrained by **environmental throughput or coordination**. To fix this, you must change the physical resources or the concurrency pattern.
Understanding wait events is simply the art of measuring exactly what caused Postgres to Sigh.
### The Diagnostic Trio
When a query is slow, we use three distinct lenses to triangulate the cause:
1. **Activity (`pg_stat_activity`)**: Is the process currently doing anything?
2. **Waiting (`wait_event`)**: If it's active but slow, what resource is it waiting for?
3. **Locking (`pg_locks`)**: If it's waiting for a lock, who exactly is holding it?
The rest of this chapter classifies the specific failure modes this framework reveals.
---
## 7.2 - The Investigative Workflow
<img src="assets/arch_kitchen_chaos_v2.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
When the database slows down, the reader's instinct is often to dive into advanced subsystems or guess at missing indexes. A systems investigator does not guess; they measure.
We measure by asking the engine four specific questions using four specific tools.
### 1. The Historical Ledger: `pg_stat_statements`
If you want to know which queries have consumed the most total resources over the last 24 hours, you query `pg_stat_statements`. This extension acts as a persistent ledger of every query's performance footprint.
```sql
-- Which queries have consumed the most cumulative execution time?
SELECT
query,
calls,
total_exec_time / 1000 AS total_seconds,
mean_exec_time AS avg_ms,
(shared_blks_hit + shared_blks_read) AS total_buffers
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
```
If a query has a massive `total_seconds` but a tiny `avg_ms`, it is a high-frequency **OLTP (Diner)** query: a thousand papercuts. If it has a massive `avg_ms` but only one call, it is a heavy **OLAP (Soup Factory)** query: a single heavy boulder causing massive I/O.
### 2. The Physical Reality: `EXPLAIN ANALYZE`
A standard `EXPLAIN` shows the query planner's cost estimates. To see the physical reality of a workload—what the query actually did and how many disk blocks it hit—you must use `EXPLAIN (ANALYZE, BUFFERS)`.
```sql
-- The fundamental diagnostic command
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM ingredients WHERE category = 'Spice';
```
If `EXPLAIN ANALYZE` shows the query completed in 10ms, but the application logged a 5000ms delay, you know the query is waiting in the environment, not spinning on the CPU.
### 3. The Real-Time State: `pg_stat_activity`
If you want to see exactly which processes are currently stuck and why, you query the live tracking view. This is your primary tool for investigating live incidents.
```sql
-- Who is currently blocked, and what are they waiting for?
SELECT pid, usename, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state != 'idle';
```
If you see **`wait_event_type: IO`**, the specific backend process is stalled at the storage layer. If you see **`wait_event_type: Lock`**, the process is blocked by another query holding a lock.
### 4. The Coordination Truth: `pg_locks`
When `pg_stat_activity` tells you a process is waiting on a lock, `pg_locks` tells you exactly who holds that lock.
```sql
-- A simplified check for granted vs ungranted locks
SELECT locktype, relation::regclass, mode, granted, pid
FROM pg_locks
WHERE relation = 'orders'::regclass;
```
---
### 🧪 Incident: The Stuck Frog (Lock Contention)
Let's put this workflow into practice.
**The Symptom**: "Simple `UPDATE` calls are taking 30+ seconds. The API is queuing up and timing out."
#### The Investigation
First, we query `pg_stat_activity` to find the blockage:
```sql
SELECT pid, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';
```
**Result**:
```text
pid | state | wait_event_type | wait_event | query
--------+---------+-----------------+-------------------+-----------------------
8551 | active | Lock | transactionid | UPDATE orders SET...
```
PID 8551 is waiting for a row lock held by another Transaction ID. Hardware is useless here; the query is blocked by another transaction.
#### Finding the Blocker
Next, we use lock tracking to find the root blocker (the "Stuck Frog") that is holding the lock but not doing anything.
```sql
-- Find the session that is blocking others
SELECT pid, state, query, xact_start
FROM pg_stat_activity
WHERE pid = (SELECT (pg_blocking_pids(8551))[1]);
```
#### Clearing the Jam
You identify that a developer left a `BEGIN` session open in their terminal without a `COMMIT`. By calling `SELECT pg_terminate_backend(blocker_pid);`, you clear the traffic jam and the API immediately resumes its flow.
> [!NOTE]
> **Recap**: `Lock:transactionid` means a transaction is holding a row you need. One forgotten session can stall an entire fleet of workers. The fix is strictly architectural: set a `lock_timeout` in production to prevent indefinite hangs.
Now that we have the tools, we will look at the specific failure modes.
---
## 7.3 - CPU Saturation (Active Execution)
<img src="assets/arch_cpu_sweat.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
### 1. The Symptom
"Our nightly report is taking forever. The server's CPU is pinned at 100%, but `pg_stat_activity` shows that the query isn't waiting for any locks or disk I/O."
### 2. The Physical Reality
Sometimes nothing is blocked. The query is simply expensive. When a process is not waiting for a lock or the disk, it is actively executing on the CPU—hashing, sorting, evaluating expressions, or scanning memory.
This represents a bottleneck of **logical complexity**.
### 3. The Wait Signature
In `pg_stat_activity`, this state presents as a `NULL` wait event (or simply `CPU`), meaning the process is currently scheduled and running on a processor core.
### 4. The Investigation
We capture a high-precision trace using `pg_wait_tracer` to determine if the backend is actively processing or waiting on a resource:
```bash
# Capture the CPU saturation with a 5-second time model
sudo ./pg_wait_tracer --count 1 --interval 5 --view time_model
```
```text
Wait Class AAS % DB Time CPU% Wait%
────────── ────── ──────── ────── ──────
CPU 1.00 99.1% 100% 0%
Client 0.02 0.9% 0% 100%
```
The tracer confirms: `Wait%` is 0%. The engine is spending 100% of its time computing.
### 5. The Root Cause
We look at the specific query:
```sql
-- Forcing CPU to calculate math on 1M rows
SELECT count(*) FROM orders
WHERE EXTRACT(year FROM order_time) = 2024;
```
This is a **Complexity Failure**. Because the `EXTRACT` function is applied to the column, the engine cannot use a B-Tree index to perform a fast lookup. Instead, it must perform a Sequential Scan, loading every single row into memory and executing the `EXTRACT` mathematical function on the CPU for every row.
### 6. The Strategic Fix
Unwrap the function to allow the engine to use its high-speed indexing machinery (a Sargable predicate).
```sql
-- Fast: Logical range check allows index usage
SELECT count(*) FROM orders
WHERE order_time >= '2024-01-01' AND order_time < '2025-01-01';
```
By removing the computational burden from the CPU, you allow the engine to find the rows instantly via the index. The query time drops from seconds to milliseconds.
---
### Diagnostic Reference — Evidence of Active Execution
When your OS monitoring shows 100% CPU utilization, you must verify what kind of active execution is occurring.
#### Pure Computation
These states signify that the backend is currently "on the CPU," actively burning cycles to transform raw Pages into query results. High CPU utilization without accompanying wait events indicates a purely computational workload.
> [!info] Reference: Active Execution
> ![[Workloads/CPU/CPU]]
> ![[Workloads/CPU/TupleProcessing]]
> ![[Workloads/CPU/ExpressionEvaluation]]
#### Internal Algorithms
Wait points during the execution of complex internal algorithms. These represent specialized internal memory operations like sorting, hashing, or memory context management.
> [!info] Reference: Internal Algorithms
> ![[Workloads/CPU/MemoryAlgorithms]]
#### Micro-Friction: Spinlocks
The highest-speed, lowest-level locking mechanism. A **Spinlock** forces a process to actively loop on a CPU core while waiting for a microscopic memory address to change state. If a process is spinning, it is burning 100% CPU without performing productive work.
> [!info] Reference: Spinlocks
> ![[Workloads/CPU/Spinning]]
Spinlocks masquerade as pure computation. Your OS shows 100% CPU, but the tracer reveals that the engine is stuck in a tight loop rather than processing data:
```bash
# What a spinlock storm looks like — 100% CPU but no productive work
sudo ./pg_wait_tracer --view system_event --interval 5 --count 1
```
```text
Wait Class Wait Event AAS % DB Time
────────── ────────── ────── ────────
CPU Spinning 14.20 89.0%
CPU — 1.80 11.0%
```
The fix for a Spinlock storm is **not** query optimization, but rather reducing concurrency via a connection pooler.
---
## 7.4 - Storage Latency (Physical IO Stalls)
<img src="assets/arch_io_friction_beaver.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
### 1. The Symptom
"Queries are taking seconds instead of milliseconds. `top` shows CPU usage is low, but database response times are consistently sluggish."
### 2. The Physical Reality
When Postgres needs a page of data that is not currently in memory (`shared_buffers`), it must issue a read request to the operating system. This is a physical stall. The CPU must stop and wait for the storage controller, the disk platter, or the SSD NAND flash to retrieve the 8KB block.
We call this **The Friction of Retrieval**. If your working set exceeds your RAM, your throughput collapses.
### 3. The Wait Signature
In `pg_stat_activity`, this state presents as an `IO` wait class. The most common event is **`DataFileRead`**, which means the backend is waiting for a block of user data. Another variant is **`BufFileWrite`** or **`BufFileRead`**, indicating the engine has run out of `work_mem` and is "spilling" a sort or hash operation to temporary disk files.
### 4. The Investigation
We trace the I/O throughput to determine if the report is bound by physical reads or memory spills:
```bash
# Capture the IO saturation with a 5-second trace
sudo ./pg_wait_tracer --view system_event --interval 5 --count 1
```
```text
Wait Event AAS % DB Time
─────────────── ────── ────────
DataFileRead 8.45 84.0% -- ⚠️ Working set > RAM
DataFileExtend 0.12 1.2% -- (Normal expansion)
```
The tracer quantifies: 84% of time is spent waiting on `DataFileRead`.
### 5. The Root Cause
We look at the specific query, a large `SELECT` on `supply_deliveries`:
```sql
SELECT sum(quantity_kg) FROM supply_deliveries;
```
Because there is no index (or the query must visit the whole table), Postgres performs a Sequential Scan. It is a rapid sequence of "Sighs" (I/O) and "Sweats" (CPU). For every 1 millisecond of productive work (`CPU`), the engine spends 8 milliseconds waiting on the disk controller (`DataFileRead`).
### 6. The Strategic Fix
Hardware fixes (faster NVMe drives) are expensive. The architectural fix is to either increase `shared_buffers` to keep more pages in RAM, or use an index to dramatically reduce the number of blocks that need to be read from disk.
If the wait event is `BufFileWrite` (a temporary file spill during a sort), the fix is to increase `work_mem` for the session.
```sql
-- Giving the engine more room to work in memory to avoid BufFileWrite
SET work_mem = '64MB';
SELECT count(*) FROM ingredients GROUP BY category;
```
---
### Diagnostic Reference — Evidence of I/O Latency
When queries are slow but CPU is low, the storage subsystem is usually the bottleneck. The specific `IO` wait event tells you exactly what kind of file Postgres is waiting for.
#### Data File Access
These events represent the engine reaching out to the filesystem to retrieve or persist actual table and index pages. A high `DataFileRead` AAS indicates your RAM is too small. A high `DataFileExtend` AAS indicates your disk allocation rate is a bottleneck.
> [!info] Reference: Data File I/O
> ![[Workloads/IO/DataFile/DataFileRead]]
> ![[Workloads/IO/DataFile/DataFileWrite]]
> ![[Workloads/IO/DataFile/DataFileExtend]]
> ![[Workloads/IO/DataFile/DataFileTruncate]]
> ![[Workloads/IO/DataFile/DataFileFlush]]
> ![[Workloads/IO/DataFile/DataFileSync]]
> ![[Workloads/IO/DataFile/DataFileImmediateSync]]
> ![[Workloads/IO/DataFile/DataFilePrefetch]]
> ![[Workloads/LWLock/Tablespace/TablespaceCreate]]
#### Temporary Files (Spills)
When a query requires more memory than `work_mem` allows, Postgres creates temporary files on disk. This is significantly slower than in-memory processing.
> [!info] Reference: Buffer Files
> ![[Workloads/IO/BufFile/BufFileRead]]
> ![[Workloads/IO/BufFile/BufFileWrite]]
> ![[Workloads/IO/BufFile/BufFileTruncate]]
#### The Cluster Ledger (Control & Lock Files)
The **Control File** is a fixed 8KB structure that stores the absolute state of the cluster. **Lock Files** ensure process-level isolation.
> [!info] Reference: Cluster Metadata I/O
> ![[Workloads/IO/ControlFile/ControlFileRead]]
> ![[Workloads/IO/ControlFile/ControlFileSync]]
> ![[Workloads/IO/ControlFile/ControlFileSyncUpdate]]
> ![[Workloads/IO/ControlFile/ControlFileWrite]]
> ![[Workloads/IO/ControlFile/ControlFileWriteUpdate]]
> ![[Workloads/IO/LockFile/LockFileCreateRead]]
> ![[Workloads/IO/LockFile/LockFileCreateSync]]
> ![[Workloads/IO/LockFile/LockFileCreateWrite]]
> ![[Workloads/IO/LockFile/LockFileReCheckDataDirRead]]
> ![[Workloads/IO/LockFile/LockFileAddToDataDirRead]]
> ![[Workloads/IO/LockFile/LockFileAddToDataDirWrite]]
> ![[Workloads/IO/LockFile/LockFileAddToDataDirSync]]
#### Transaction State (SLRU & MultiXact)
Simple Least-Recently-Used (SLRU) buffers track the commit status of every transaction. High waits here (`SLRURead`) indicate that your transaction throughput has outpaced the SLRU memory cache.
> [!info] Reference: Transactional State I/O
> ![[Workloads/IO/SLRU/SLRURead]]
> ![[Workloads/IO/SLRU/SLRUWrite]]
> ![[Workloads/IO/SLRU/SLRUSync]]
> ![[Workloads/IO/SLRU/SLRUFlushSync]]
> ![[Workloads/LWLock/MultiXact/MultiXactMemberBuffer]]
> ![[Workloads/LWLock/MultiXact/MultiXactMemberSLRU]]
> ![[Workloads/LWLock/MultiXact/MultiXactOffsetBuffer]]
> ![[Workloads/LWLock/MultiXact/MultiXactOffsetSLRU]]
#### Maintenance & Data Movement
Operations that handle large-scale data transfer, such as physical backups, bulk loading, and cluster upgrades.
> [!info] Reference: Maintenance I/O
> ![[Workloads/IO/BaseBackup/BaseBackupRead]]
> ![[Workloads/IO/BaseBackup/BaseBackupSync]]
> ![[Workloads/IO/BaseBackup/BaseBackupWrite]]
> ![[Workloads/IO/CopyFile/CopyFileCopy]]
> ![[Workloads/IO/CopyFile/CopyFileRead]]
> ![[Workloads/IO/CopyFile/CopyFileWrite]]
> ![[Workloads/IO/VersionFile/VersionFileSync]]
> ![[Workloads/IO/VersionFile/VersionFileWrite]]
#### Logical Decoding & Replication State
If you use CDC tools, the engine must reconstruct transactions from WAL before shipping them. `ReorderBufferWrite` spikes when large transactions exceed `logical_decoding_work_mem`.
> [!info] Reference: Logical Decoding I/O
> ![[Workloads/IO/LogicalRewrite/LogicalRewriteCheckpointSync]]
> ![[Workloads/IO/LogicalRewrite/LogicalRewriteMappingSync]]
> ![[Workloads/IO/LogicalRewrite/LogicalRewriteMappingWrite]]
> ![[Workloads/IO/LogicalRewrite/LogicalRewriteSync]]
> ![[Workloads/IO/LogicalRewrite/LogicalRewriteTruncate]]
> ![[Workloads/IO/LogicalRewrite/LogicalRewriteWrite]]
> ![[Workloads/IO/ReorderBuffer/ReorderLogicalMappingRead]]
#### The Identity Map (Relation Mapping)
> [!info] Reference: Relation Mapping I/O
> ![[Workloads/IO/RelationMap/RelationMapRead]]
> ![[Workloads/IO/RelationMap/RelationMapReplace]]
> ![[Workloads/IO/RelationMap/RelationMapWrite]]
#### Atomic Distributed Commit & AIO
> [!info] Reference: Two-Phase Commit I/O
> ![[Workloads/IO/TwophaseFile/TwophaseFileRead]]
> ![[Workloads/IO/TwophaseFile/TwophaseFileSync]]
> ![[Workloads/IO/TwophaseFile/TwophaseFileWrite]]
> ![[Workloads/IO/AIO/AioIoUringSubmit]]
> ![[Workloads/IO/AIO/AioIoCompletion]]
> ![[Workloads/IO/AIO/AioIoUringExecution]]
---
## 7.5 - WAL Pressure (Durability Bottlenecks)
<img src="assets/arch_pocket_diary.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
### 1. The Symptom
"Our bulk update is crawling. `pg_stat_activity` shows dozens of sessions, but they're all just 'active' and taking 200ms per simple update."
### 2. The Physical Reality
While most read waits are caused by missing pages in RAM, write waits are almost always tied to Postgres's strict durability guarantees. Before a transaction can commit, the backend must ensure its **[[Structures/WAL|Write-Ahead Log (WAL)]]** record has been flushed to physical storage.
Because the WAL is a single, sequential stream, it presents two distinct bottlenecks: **I/O Latency** (how fast we can write to disk) and **Lock Contention** (how fast we can coordinate which process gets to write next).
### 3. The Wait Signature
In `pg_stat_activity`, this state presents as a combination of `LWLock` and `IO` events:
- **`LWLock:WALWrite`**: The process is queued, waiting for another process to finish flushing the shared WAL buffer to disk.
- **`IO:WALSync`**: The process is actively waiting for the hardware to confirm the write and flush its internal caches.
- **`LWLock:WALInsert`**: In high-concurrency workloads, backends may fight for this lock just to reserve their byte offset in the sequential stream.
### 4. The Investigation
We run the tracer while the update script is active to see why the engine is stalled:
```bash
# Quantify the write overhead
sudo ./pg_wait_tracer --view system_event --interval 5 --count 1
```
```text
Wait Class Wait Event AAS % DB Time
────────── ────────────── ────── ────────
LWLock WALWrite 8.42 80.0%
IO WALSync 1.05 10.0%
CPU — 1.00 9.5%
```
The tracer reveals that **80% of cluster time** is spent in `WALWrite`.
### 5. The Root Cause
A bulk update script is running hundreds of independent UPDATE statements without grouping them into a single transaction:
```sql
-- Naive: Updating 100 rows, one commit at a time
UPDATE orders SET status = 'Pending' WHERE id = 1;
UPDATE orders SET status = 'Pending' WHERE id = 2;
-- ... repeat 98 more times ...
```
Because each UPDATE is its own transaction, the engine must flush the WAL buffer to disk for every single row. A single 100-row update generates ~830KB of WAL records, but forcing 100 separate physical flushes turns a bandwidth problem into a latency problem.
### 6. The Strategic Fix
Wrap the updates into a single atomic transaction. This allows the engine to buffer all changes and perform a single, efficient flush at the end.
```sql
-- Fast: Atomic batching
BEGIN;
UPDATE orders SET status = 'Pending' WHERE id <= 100;
COMMIT;
```
By moving from 100 flushes to 1 flush, the `WALWrite` wait events disappear, and the database bottleneck shifts from disk latency to pure CPU throughput.
---
### Diagnostic Reference — Evidence of WAL Durability
When your system is write-bound, the following wait events help you identify exactly where the sequential write stream is bottlenecked.
#### Physical WAL Durability & Throughput
Transaction commits are gated by physical persistence. These events represent the engine handing data to the disk controller and waiting for a hardware acknowledgment.
> [!info] Reference: Physical WAL Durability
> ![[Workloads/IO/WAL/WALWrite]]
> ![[Workloads/IO/WAL/WALSync]]
> ![[Workloads/IO/WAL/WALSyncMethodAssign]]
#### High-Concurrency Hotspots (LWLocks)
Before the physical disk is even accessed, backends must acquire memory-level locks to reserve their position in the sequential WAL stream.
> [!info] Reference: WAL LWLocks
> ![[Workloads/LWLock/WAL/WALInsert]]
> ![[Workloads/LWLock/WAL/WALWrite]]
> ![[Workloads/LWLock/WAL/XLogBuffer]]
> ![[Workloads/LWLock/WAL/WALBufMapping]]
> ![[Workloads/LWLock/WAL/WALSummarizer]]
`WALInsert` contention means too many backends are fighting to *reserve their slot* in the sequential stream simultaneously — the fix is to reduce write concurrency (batch transactions, connection pooling). `WALWrite` contention means backends are queued to *flush* the buffer — the fix is faster disk or a dedicated WAL volume.
#### Background Maintenance & Initialization
Postgres pre-allocates physical 16MB WAL files on disk (`WALInitWrite`) to avoid latency spikes. If you see waits here, the system is burning through WAL segments faster than it can pre-allocate them.
> [!info] Reference: WAL Provisioning
> ![[Workloads/IO/WAL/WALInitSync]]
> ![[Workloads/IO/WAL/WALInitWrite]]
> ![[Workloads/IO/WAL/WALCopyRead]]
> ![[Workloads/IO/WAL/WALCopySync]]
> ![[Workloads/IO/WAL/WALCopyWrite]]
> ![[Workloads/IO/WAL/WALBootstrapSync]]
> ![[Workloads/IO/WAL/WALBootstrapWrite]]
> [!info] Reference: Background Synchronization
> ![[Workloads/Activity/Replication/WalWriterMain]]
> ![[Workloads/Activity/Maintenance/CheckpointerMain]]
> ![[Workloads/IPC/Checkpoint/CheckpointStart]]
> ![[Workloads/IPC/Checkpoint/CheckpointDone]]
#### WAL Reading & Historical Access
Under normal conditions, the WAL is a write-only stream. However, during recovery, standby synchronization, or incremental backups, the engine must read the log to reconstruct state or generate change summaries.
> [!info] Reference: WAL Reading & Backups
> ![[Workloads/IO/WAL/WALRead]]
> ![[Workloads/IO/WAL/WalSummaryRead]]
> ![[Workloads/IO/WAL/WalSummaryWrite]]
> ![[Workloads/IO/WAL/WALSenderTimelineHistoryRead]]
---
## 7.6 - Lock Contention (Concurrency & Blocking)
<img src="assets/arch_heavyweight_locks.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
### 1. The Symptom
"I'm just trying to add a single column to the `ingredients` table. It's a metadata change, but suddenly the entire application has stopped being able to read from that table!"
### 2. The Physical Reality
While LWLocks protect shared-memory structures, **Heavyweight Locks** ensure the logical consistency of data. They coordinate transactional access to tables, rows, and other database objects. When a transaction holds a lock on a resource, concurrent processes requiring a conflicting lock mode must wait until the holder completes.
Unlike LWLocks, which are typically held for microseconds, heavyweight locks are **transactional**. They are held for the entire duration of a transaction, from the first acquisition until `COMMIT` or `ROLLBACK`. The problem is that while one transaction holds an exclusive lock, all other transactions queue up behind it, creating a traffic jam.
### 3. The Wait Signature
In `pg_stat_activity`, this state presents as a `Lock` wait class:
- **`Lock:relation`**: The process is waiting for a table-level lock (often caused by DDL or `VACUUM FULL`).
- **`Lock:transactionid`**: The process is waiting for a row-level lock held by another transaction (often caused by concurrent `UPDATE`s).
### 4. The Investigation
We observe the blocker tree and queue depth using the high-precision tracer:
```bash
# Observe the DDL lock queue
sudo ./pg_wait_tracer --view active --count 1
```
```text
PID State Wait Event Wait (ms) DB Time (ms)
────── ───────── ────────────── ────────── ────────────
10555 waiting Lock:relation 9200.0 9800.2
10501 waiting Lock:relation 8900.0 9500.5
10502 waiting Lock:relation 8400.0 9100.1
10500 on cpu — — 12000.0
```
### 5. The Root Cause
The tracer reveals that PID 10500 (a long-running `SELECT`) is the **Root Blocker**. The `ALTER TABLE` command requires an **AccessExclusiveLock**, so it must wait for all existing readers to finish. While it waits in the queue, it blocks *all new* readers behind it.
You have accidentally closed the Narrow Gate for the entire application by running a DDL migration on a busy table without setting a timeout.
```mermaid
graph TD
Resource[(Table: ingredients)]
subgraph "The Lock Queue (FIFO)"
A[PID 10500: SELECT] -->|GRANTED| B{AccessShareLock}
C[PID 10555: ALTER] -.->|"WAITING: #1"| D{AccessExclusiveLock}
E[PID 10501: SELECT] -.->|"WAITING: #2"| F{AccessShareLock}
G[PID 10502: SELECT] -.->|"WAITING: #3"| H{AccessShareLock}
end
B --- Resource
D --- Resource
F -.->|Blocked by 10555| D
H -.->|Blocked by 10555| D
style A fill:#e8f5e9,stroke:#2e7d32
style C fill:#ffebee,stroke:#c62828
style E fill:#fff3e0,stroke:#e65100
style G fill:#fff3e0,stroke:#e65100
```
### 6. The Strategic Fix
Never run DDL on a busy table without a **`lock_timeout`**. This ensures the DDL script fails gracefully rather than bringing down the app.
```sql
-- Setting a 1-second patience limit
SET lock_timeout = '1s';
ALTER TABLE ingredients ADD COLUMN scent_intensity INT;
```
The DDL command will now either succeed instantly or fail and let the traffic continue. You can then retry the migration during a lower-traffic window.
---
### Diagnostic Reference — Evidence of Logical Blocking
When queries are stuck waiting on `Lock` events, the bottleneck is logical coordination, not hardware.
#### Tables & Rows
Heavyweight locks restrict access based on conflicting modes. If Process A attempts to `UPDATE` a row that Process B has modified but not yet committed, Process A will block on **`Lock:transactionid`** until Process B's transaction resolves.
> [!info] Reference: Tables & Rows
> ![[Workloads/Lock/relation]]
> ![[Workloads/Lock/transactionid]]
> ![[Workloads/Lock/tuple]]
#### Database Object Locks
Beyond tables and rows, Postgres must occasionally lock internal physical structures. For example, when a table fills up, a backend must ask the OS to allocate new 8KB blocks. To prevent two backends from concurrently trying to extend the same file, the engine acquires a `Lock:extend`.
> [!info] Reference: Database Objects
> ![[Workloads/Lock/page]]
> ![[Workloads/Lock/extend]]
> ![[Workloads/Lock/object]]
During bulk loads, `Lock:extend` becomes a significant bottleneck. The fix for high-throughput bulk loads is to partition the table so workers write to different files.
#### Advisory Locks
Postgres allows developers to define their own logical locks using `pg_advisory_lock()`. Think of it as an application reserving an arbitrary name that does not correspond to a physical database object, but is still enforced globally by the engine's Lock Manager.
> [!info] Reference: Advisory Locks
> ![[Workloads/Lock/advisory]]
> ![[Workloads/Lock/userlock]]
#### Virtual Transaction Identifiers
The most critical event in this group is `Lock:frozenid` — the diagnostic signal for **Anti-Wraparound Vacuum**. An `autovacuum worker` blocked on `Lock:frozenid` means a long-running user query is preventing the engine from advancing its internal transaction horizons. You must terminate the blocker to prevent eventual database lockdown.
> [!info] Reference: Virtual Transaction Identifiers
> ![[Workloads/Lock/virtualxid]]
#### Lock Manager Mechanics
The **Lock Manager** is the shared memory subsystem that tracks heavyweight locks. High table counts can exhaust the **Fast Path** local array and force backends to bottleneck on the global manager partitions.
> [!info] Reference: Lock Manager
> ![[Workloads/LWLock/LockManager/LockManager]]
> ![[Workloads/LWLock/LockManager/LockFastPath]]
---
## 7.7 - Cluster Pressure (Background Workers & Replication)
<img src="assets/arch_background_workers.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
### 1. The Symptom
"Our read-replica is falling behind by over 30 seconds. The application is reading stale data. Is the network dropping packets or is the replica disk too slow?"
### 2. The Physical Reality
In a distributed environment, the primary server must coordinate with standby replicas to ensure data is mirrored correctly. This introduces **The Distributed Sigh**—the delay of information traveling across the network to find a safe harbor on a remote disk.
Even when no user is connected, the engine is not truly silent. It has an **Internal Rhythm**—a heartbeat of maintenance workers (Autovacuum, Checkpointer) that must coordinate with active queries. When these workers wait, or when replication falls behind, the entire cluster experiences pressure.
### 3. The Wait Signature
If the replication lag is a network issue, you will see `Client:WalSenderWriteData` on the primary. If it's a disk issue on the standby, you will see `IO:DataFileRead` on the `walreceiver` or `startup` processes.
### 4. The Investigation
You need to determine if this is a *sending* bottleneck or an *applying* bottleneck. Check the WAL sender's state on the **primary** server using the high-precision tracer:
```bash
# On the PRIMARY: Check the WAL sender's time model
sudo ./pg_wait_tracer --view active --count 1
```
```text
PID State Wait Event Wait (ms) Backend Type
────── ───────── ───────────────── ────────── ────────────
1234 active Client:WalSenderWriteData 450.2 walsender
5678 on cpu — — client
```
If the WAL sender is stuck in `WalSenderWriteData`, the **network** is the bottleneck—the primary is trying to push data but the pipe is full. If the WAL sender shows `WalSenderMain` (idle), the primary is healthy and the lag is purely an **applying** problem on the standby.
### 5. The Root Cause
Confirm the applying bottleneck on the **standby** server:
```bash
# On the STANDBY: Check the replay worker's throughput
sudo ./pg_wait_tracer --view active --count 1
```
```text
PID State Wait Event Wait (ms) Backend Type
────── ───────── ────────────── ────────── ────────────
1050 active IO:DataFileRead 8200.0 walreceiver
1051 active IO:DataFileRead 7500.0 startup
```
The `walreceiver` is waiting on `DataFileRead` rather than `WalReceiverMain`. This means it is not waiting for the network; it is actively fetching data from its own disk. The `startup` process (which applies the WAL) is also stalled on `DataFileRead`.
This is an **Applying Bottleneck**. The primary is sending data just fine, but the standby's storage is too slow to replay the changes at the rate they are arriving.
### 6. The Strategic Fix
You cannot fix this by tweaking network settings. You must either upgrade the IOPS of the replica's storage volume or investigate if the primary is running massive, unoptimized bulk updates that cause a "WAL explosion."
---
### Diagnostic Reference — Evidence of Cluster Pressure
When the cluster itself is under pressure, you will see wait events in the background workers and replication processes rather than in user queries.
#### Background System Activity
These events represent the "internal life" of the database—the background workers responsible for maintenance, logging, and durability. High waits here typically indicate that background processes are competing with user backends for shared resources.
> [!info] Reference: Maintenance Workers
> ![[Workloads/Activity/Maintenance/AutoVacuumMain]]
> ![[Workloads/Activity/Maintenance/CheckpointerMain]]
> ![[Workloads/Activity/Logical/LogicalApplyMain]]
> ![[Workloads/Activity/Logical/LogicalLauncherMain]]
> ![[Workloads/Activity/Maintenance/ArchiverMain]]
> ![[Workloads/Activity/Maintenance/BgWriterMain]]
> ![[Workloads/Activity/Maintenance/SysLoggerMain]]
When the background rhythm breaks, the tracer reveals the starvation. For example, if an `autovacuum worker` is blocked on `BufferPin` for 3 minutes, a user query is pinning a page that vacuum needs to clean.
#### Replication Coordination (IPC)
The "Distributed Sigh" of a cluster. These Inter-Process Communication (IPC) signals measure the efficiency of data transfer and synchronization between nodes.
##### 1. Streaming Replication Control
Signals related to the core physical replication protocol and the synchronization of WAL records to standby replicas.
> [!info] Reference: Streaming Replication
> ![[Workloads/Activity/Replication/WalSenderMain]]
> ![[Workloads/LWLock/Replication/SyncRep]]
> ![[Workloads/Activity/Replication/WalReceiverMain]]
> ![[Workloads/IPC/LogicalRep/WalReceiverExit]]
> ![[Workloads/IPC/LogicalRep/WalReceiverWaitStart]]
##### 2. Logical Replication & Worker State
Coordination between the primary server and the specialized workers that handle logical decoding, table synchronization, and parallel apply operations.
> [!info] Reference: Logical Coordination
> ![[Workloads/IPC/LogicalRep/LogicalApplySendData]]
> ![[Workloads/IPC/LogicalRep/LogicalParallelApplyStateChange]]
> ![[Workloads/IPC/LogicalRep/LogicalSyncData]]
> ![[Workloads/IPC/LogicalRep/LogicalSyncStateChange]]
> ![[Workloads/Activity/Logical/LogicalParallelApplyMain]]
> ![[Workloads/LWLock/Parallel/LogicalRepLauncherDSA]]
> ![[Workloads/LWLock/Parallel/LogicalRepLauncherHash]]
##### 3. Slot & Origin Management
The persistence of replication progress. **Slots** ensure that the primary doesn't delete WAL files before replicas have seen them.
> [!info] Reference: Slot & State Management
> ![[Workloads/IPC/LogicalRep/ReplicationOriginDrop]]
> ![[Workloads/IPC/LogicalRep/ReplicationSlotDrop]]
> ![[Workloads/IO/Replication/ReplicationSlotRead]]
> ![[Workloads/IO/Replication/ReplicationSlotRestoreSync]]
> ![[Workloads/IO/Replication/ReplicationSlotSync]]
> ![[Workloads/IO/Replication/ReplicationSlotWrite]]
> ![[Workloads/IO/Replication/SnapbuildRead]]
> ![[Workloads/IO/Replication/SnapbuildSync]]
> ![[Workloads/IO/Replication/SnapbuildWrite]]
##### 4. Replication History & Timelines
Signals related to the complex "history" of a cluster, specifically when a replica must catch up or reconcile its branch of the transaction log with the primary (e.g., during `pg_rewind`).
> [!info] Reference: Timelines & History
> ![[Workloads/IO/Replication/TimelineHistoryRead]]
> ![[Workloads/IO/Replication/TimelineHistoryWrite]]
> ![[Workloads/IO/Replication/TimelineHistoryFileWrite]]
> ![[Workloads/IO/Replication/TimelineHistorySync]]
> ![[Workloads/IO/Replication/TimelineHistoryFileSync]]
> ![[Workloads/IO/WAL/WALSenderTimelineHistoryRead]]
#### Distribution & Recovery Coordination
Signals related to the complex state transitions of cluster recovery, node promotion, and the persistence of "Prepared Transactions" for Two-Phase Commit (2PC).
> [!info] Reference: Recovery & Distribution
> ![[Workloads/IPC/Recovery/ArchiveCleanupCommand]]
> ![[Workloads/IPC/Recovery/ArchiveCommand]]
> ![[Workloads/IPC/Recovery/BackupWaitWalArchive]]
> ![[Workloads/IPC/Recovery/Promote]]
> ![[Workloads/IPC/Recovery/RecoveryConflictSnapshot]]
> ![[Workloads/IPC/Recovery/RecoveryConflictTablespace]]
> ![[Workloads/IPC/Recovery/RecoveryEndCommand]]
> ![[Workloads/IPC/Recovery/RecoveryPause]]
> ![[Workloads/IPC/Recovery/RestoreCommand]]
> ![[Workloads/LWLock/Xact/TwoPhaseState]]
> ![[Workloads/LWLock/Xact/XactBuffer]]
> ![[Workloads/LWLock/Xact/XactSLRU]]
> ![[Workloads/LWLock/Xact/XactTruncation]]
> ![[Workloads/LWLock/Replication/ReplicationOrigin]]
> ![[Workloads/LWLock/Replication/ReplicationOriginState]]
> ![[Workloads/LWLock/Replication/ReplicationSlotAllocation]]
> ![[Workloads/LWLock/Replication/ReplicationSlotControl]]
> ![[Workloads/LWLock/Replication/ReplicationSlotIO]]
> ![[Workloads/Activity/Replication/WalSummarizerWal]]
> ![[Workloads/Activity/Replication/RecoveryWalStream]]
> ![[Workloads/Activity/Replication/ReplicationSlotsyncMain]]
> ![[Workloads/Activity/Replication/ReplicationSlotsyncShutdown]]
> ![[Workloads/Timeout/WalSummarizerError]]
> ![[Workloads/Timeout/RecoveryRetrieveRetryInterval]]
> ![[Workloads/LWLock/ControlFile/ControlFile]]
> ![[Workloads/IPC/System/WalSummaryReady]]
#### Client Protocol & Replication Handshakes
Wait points occurring at the edge of the engine, where Postgres is negotiating secure connections or waiting for a replica to acknowledge data receipt.
> [!info] Reference: Client Protocol
> ![[Workloads/Client/WalSenderWriteData]]
> ![[Workloads/Client/WaitForStandbyConfirmation]]
> ![[Workloads/Client/SSLOpenServer]]
> ![[Workloads/Client/LibPQWalReceiverConnect]]
> ![[Workloads/Client/LibPQWalReceiverReceive]]
> ![[Workloads/Client/GSSOpenServer]]
> ![[Workloads/Client/WalSenderWaitForWAL]]
#### Maintenance & System Health
Wait points related to the internal "metabolism" of the database. These events coordinate the background workers that keep the cluster healthy without direct user intervention.
> [!info] Reference: System Maintenance
> ![[Workloads/LWLock/Autovacuum/Autovacuum]]
> ![[Workloads/LWLock/Autovacuum/AutovacuumSchedule]]
> ![[Workloads/LWLock/Autovacuum/WrapLimitsVacuum]]
> ![[Workloads/LWLock/Checkpointer/CheckpointerComm]]
> ![[Workloads/Activity/Maintenance/CheckpointerShutdown]]
> ![[Workloads/IPC/Checkpoint/CheckpointDelayStart]]
> ![[Workloads/IPC/Checkpoint/CheckpointDelayComplete]]
> ![[Workloads/Activity/Maintenance/BgWriterHibernate]]
> ![[Workloads/IPC/Process/BackendTermination]]
---
## 7.8 - Internal Coordination (Parallelism & Shared Memory)
<img src="assets/arch_master_keys.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
### 1. The Symptom
"During our Black Friday sale, CPU usage hit 100% and queries slowed down from 1ms to 20ms. We upgraded to a 64-core machine, but it didn't help. We're not seeing any disk I/O or row locks."
### 2. The Physical Reality
Deep within the engine, even memory access must be coordinated. This is **The Bureaucratic Friction**—the microscopic LWLocks and IPC signals that ensure two processes don't try to write to the same memory slot or read the same snapshot simultaneously. These represent the "Master Keys" of the database.
When hundreds of parallel workers or connections attempt to read from the same memory block or update the same global array, they must queue at the microsecond level.
### 3. The Wait Signature
In `pg_stat_activity`, this state presents as an `IPC` or `LWLock` wait class:
- **`LWLock:ProcArray`**: The process is waiting to read or update the global array of active transactions.
- **`LWLock:BufferContent`**: The process is waiting to pin a specific 8KB page in shared memory.
- **`IPC:ExecuteGather`**: The leader of a parallel query is waiting for its workers to finish.
### 4. The Investigation
We check the global wait event distribution using the high-precision tracer:
```bash
# Measure the internal coordination bottleneck
sudo ./pg_wait_tracer --view system_event --interval 5 --count 1
```
```text
Wait Class Wait Event AAS % DB Time Wait (ms)
────────── ────────── ────── ──────── ────────
LWLock ProcArray 28.50 95.0% 142500
CPU — 1.50 5.0% 7500
```
95% of cluster time is spent in `LWLock:ProcArray`. Only 5% is productive computation. No amount of CPU upgrade will help—the latch is the ceiling.
### 5. The Root Cause
You are suffering from a **ProcArray Stall**. To ensure MVCC (Multi-Version Concurrency Control), every query must ask, "Which other transactions are currently running?" To answer this, the engine scans the `ProcArray`—a shared memory structure tracking all active connections.
When you have 500 active connections, scanning the array takes longer. Worse, while one backend is updating the array (e.g., to commit), it takes an exclusive `LWLock` on it, forcing the other 499 backends to wait. Upgrading CPU cores often makes this *worse*, as more cores means more aggressive lock contention.
### 6. The Strategic Fix
The database has too many direct connections. You must implement a **Connection Pooler** (like PgBouncer) to multiplex thousands of application connections down to a small, fixed number of database connections (e.g., 50).
---
### Diagnostic Reference — Evidence of Internal Friction
When the bottleneck is deep inside the engine's memory or execution fabric, you will see the following LWLocks and IPC signals.
#### Visibility & Snapshot Coordination
These events represent the engine's "Atomic Clock." They manage the global process array and subtransaction logs that determine which rows are visible to which transactions.
> [!info] Reference: Snapshots & Visibility
> ![[Workloads/LWLock/ProcArray/ProcArray]]
> ![[Workloads/LWLock/ProcArray/OldSnapshotTimeMapping]]
> ![[Workloads/LWLock/Subtrans/SubtransBuffer]]
> ![[Workloads/LWLock/Subtrans/SubtransSLRU]]
> ![[Workloads/IPC/System/SafeSnapshot]]
> ![[Workloads/IPC/System/ProcSignalBarrier]]
> ![[Workloads/IPC/Transaction/ProcArrayGroupUpdate]]
> ![[Workloads/IPC/Transaction/MultixactCreation]]
> ![[Workloads/LWLock/MultiXact/MultiXactGen]]
> ![[Workloads/LWLock/MultiXact/MultiXactMemberBuffer]]
> ![[Workloads/LWLock/MultiXact/MultiXactMemberSLRU]]
> ![[Workloads/LWLock/MultiXact/MultiXactOffsetBuffer]]
> ![[Workloads/LWLock/MultiXact/MultiXactOffsetSLRU]]
> ![[Workloads/LWLock/MultiXact/MultiXactTruncation]]
#### The Memory Floor: Buffers & Caches
Low-level memory locks that coordinate access to `shared_buffers` and the Catalog Cache. If you see `LWLock:BufferContent`, multiple backends are fighting to read the exact same physical memory address simultaneously (a "Hot Page").
> [!info] Reference: Buffer & Cache Coordination
> ![[Workloads/LWLock/Buffers/BufferContent]]
> ![[Workloads/LWLock/Buffers/BufferMapping]]
> ![[Workloads/IPC/Storage/BufferIO]]
> ![[Workloads/BufferPin/BufferPin]]
> ![[Workloads/LWLock/Buffers/SharedTidBitmap]]
> ![[Workloads/LWLock/Buffers/SharedTupleStore]]
> ![[Workloads/LWLock/Buffers/BtreeVacuum]]
> ![[Workloads/LWLock/Catalog/RelationMapping]]
> ![[Workloads/LWLock/Catalog/RelCacheInit]]
> ![[Workloads/LWLock/Catalog/OidGen]]
> ![[Workloads/LWLock/Catalog/SInvalRead]]
> ![[Workloads/LWLock/Catalog/SInvalWrite]]
> ![[Workloads/LWLock/Shmem/ShmemIndex]]
> ![[Workloads/LWLock/Shmem/AddinShmemInit]]
> ![[Workloads/LWLock/SeqScan/SyncScan]]
> ![[Workloads/LWLock/AIO/AioUringCompletion]]
> ![[Workloads/LWLock/AIO/AioWorkerSubmissionQueue]]
#### Parallel Query Coordination
Wait events related to the active execution of a parallel plan. `ExecuteGather` indicates the leader is waiting for workers. If this is high, you have **Parallel Skew**—one worker is stuck holding up the team.
> [!info] Reference: Query Execution Coordination
> ![[Workloads/IPC/Parallel/ExecuteGather]]
> ![[Workloads/IPC/Parallel/ParallelBitmapScan]]
> ![[Workloads/IPC/Parallel/ParallelCreateIndexScan]]
> ![[Workloads/LWLock/Parallel/ParallelBtreeScan]]
> ![[Workloads/LWLock/Parallel/ParallelHashJoin]]
#### Worker Lifecycle & Synchronization
Signals related to the startup, phase changes, and shutdown of parallel workers. `ParallelFinish` spikes when a query completes so quickly that the time spent managing workers exceeds the time spent processing rows.
> [!info] Reference: Lifecycle & Sync
> ![[Workloads/IPC/Parallel/ParallelFinish]]
> ![[Workloads/IPC/Parallel/AppendReady]]
> ![[Workloads/LWLock/Parallel/ParallelAppend]]
#### Shared Memory Management (DSM & DSA)
Parallel workers communicate through Dynamic Shared Memory (DSM) segments.
> [!info] Reference: Shared Memory Infrastructure
> ![[Workloads/IPC/Transaction/XactGroupUpdate]]
> ![[Workloads/LWLock/Parallel/ParallelQueryDSA]]
> ![[Workloads/LWLock/Parallel/ParallelVacuumDSA]]
> ![[Workloads/LWLock/Parallel/PerSessionDSA]]
> ![[Workloads/LWLock/Parallel/DSMRegistry]]
> ![[Workloads/LWLock/Parallel/DSMRegistryDSA]]
> ![[Workloads/LWLock/Parallel/DSMRegistryHash]]
#### Parallel Infrastructure: Hashing & Messaging
Wait points for the "Interconnect" of the parallel engine.
> [!info] Reference: Parallel Fabric
> ![[Workloads/IPC/Hash/HashBatchAllocate]]
> ![[Workloads/IPC/Hash/HashBatchElect]]
> ![[Workloads/IPC/Hash/HashBatchLoad]]
> ![[Workloads/IPC/Hash/HashBuildAllocate]]
> ![[Workloads/IPC/Hash/HashBuildElect]]
> ![[Workloads/IPC/Hash/HashBuildHashInner]]
> ![[Workloads/IPC/Hash/HashBuildHashOuter]]
> ![[Workloads/IPC/Hash/HashGrowBatchesDecide]]
> ![[Workloads/IPC/Hash/HashGrowBatchesElect]]
> ![[Workloads/IPC/MessageQueue/MessageQueueInternal]]
> ![[Workloads/IPC/MessageQueue/MessageQueuePutMessage]]
> ![[Workloads/IPC/MessageQueue/MessageQueueReceive]]
> ![[Workloads/IPC/MessageQueue/MessageQueueSend]]
> ![[Workloads/LWLock/BackgroundWorker/BackgroundWorker]]
> ![[Workloads/IO/DSM/DSMFillZeroWrite]]
> ![[Workloads/IO/DSM/DSMAllocate]]
#### Transaction Identity & Serializability
Wait events related to the generation of Transaction IDs and the enforcement of Serializable Snapshot Isolation (SSI).
> [!info] Reference: Identity & Consistency
> ![[Workloads/LWLock/Xact/XidGen]]
> ![[Workloads/LWLock/SSI/PredicateLockManager]]
> ![[Workloads/LWLock/SSI/SerialControl]]
> ![[Workloads/LWLock/SSI/SerializableFinishedList]]
> ![[Workloads/LWLock/SSI/SerializablePredicateList]]
> ![[Workloads/LWLock/SSI/SerializableXactHash]]
> ![[Workloads/LWLock/Serial/SerialBuffer]]
> ![[Workloads/LWLock/Serial/SerialSLRU]]
> ![[Workloads/LWLock/LockManager/LockManager]]
---
## 7.9 - Human-Caused Stalls (Administrative Blocking)
<img src="assets/arch_starvation.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
### 1. The Symptom
"We have a critical lock contention on the `animals` table, but the blocker PID 5432 shows a `wait_event` of `ClientRead`. How can a process be 'waiting for the client' while also blocking others?"
### 2. The Physical Reality
Not every stall is caused by the database itself. Sometimes Postgres is completely healthy, but the application cannot feed it work fast enough. This is **The Starvation**—the database engine sitting idle, waiting for the user or the application to send the next command. When this idle state occurs *inside* an open transaction, the client effectively holds database resources hostage while it performs external logic (like an API call).
### 3. The Wait Signature
In `pg_stat_activity`, this state presents as a `Client` wait class:
- **`Client:ClientRead`**: Postgres has finished its work and is waiting for the application to send the next query.
- **`Client:ClientWrite`**: Postgres is waiting for the network buffer to clear so it can send more data back to the client.
### 4. The Investigation
We check the state of the blocker and the cascade of victims using the high-precision tracer:
```bash
# Visualize the cascade of victims
sudo ./pg_wait_tracer --view active --count 1
```
```text
PID State Wait Event Wait (ms) Backend Type
────── ────────────────── ─────────────────── ────────── ────────────
5432 idle in transaction Client:ClientRead — client
5501 waiting Lock:transactionid 14200.0 client
5502 waiting Lock:transactionid 12800.0 client
5503 waiting Lock:transactionid 11500.0 client
```
The single forgotten `BEGIN` (PID 5432) is visible as a **Root Blocker** with three downstream victims.
### 5. The Root Cause
The state of PID 5432 is **`idle in transaction`**. The application opened a transaction, issued an `UPDATE` (which acquired a lock), and then "went away" to do something else without committing. Postgres is now starving—waiting for the client to finish the transaction while faithfully holding the exclusive lock required by the uncommitted UPDATE. The downstream clients are blocked on `Lock:transactionid`.
### 6. The Strategic Fix
Set an **`idle_in_transaction_session_timeout`** at the database level. This ensures that any session that "forgets" to commit will be automatically terminated after a short window.
```sql
-- Terminate any session that stays idle in a transaction for > 1 minute
ALTER SYSTEM SET idle_in_transaction_session_timeout = '1min';
SELECT pg_reload_conf();
```
The database protects itself from "Human-Caused Stalls," ensuring that a single developer's forgotten `BEGIN` block cannot take down the production cluster.
> [!NOTE]
> **Recap**: `ClientRead` in an `active` state is normal. `ClientRead` in an `idle in transaction` state is a bug.
---
### Diagnostic Reference — Evidence of Application Starvation
When wait events are primarily in the `Client` or `Timeout` class, the bottleneck is external to the database engine.
#### Client Interactions
The primary wait points for the communication protocol between the client and the server. High AAS in the `Client` class usually points to application-side performance issues ("Think Time") or network bandwidth bottlenecks.
> [!info] Reference: Client Waits
> ![[Workloads/Client/ClientRead]]
> ![[Workloads/Client/ClientWrite]]
#### Intentional Throttling (Timeouts)
Wait points where the database intentionally pauses a process. This can be at the user's request (e.g., calling `pg_sleep`) or due to background throttling to prevent resource exhaustion during administrative maintenance (like Vacuum or Base Backups).
> [!info] Reference: Timeouts & Throttling
> ![[Workloads/Timeout/PgSleep]]
> ![[Workloads/Timeout/BaseBackupThrottle]]
> ![[Workloads/Timeout/CheckpointerWriteDelay]]
> ![[Workloads/Timeout/RecoveryApplyDelay]]
> ![[Workloads/Timeout/VacuumDelay]]
> ![[Workloads/Timeout/VacuumTruncate]]
---
## 7.10 - High-Precision Lab (pg_wait_tracer)
<img src="assets/arch_performance_audit.png" width="250" style="float: left; margin: 0 20px 20px 0;" />
In the previous chapters, we used `pg_wait_tracer` to isolate specific bottlenecks in CPU, I/O, and Locking. But in a production environment, failures rarely travel alone. A slow disk (I/O) causes transactions to stay open longer, which creates row lock contention, which eventually saturates the connection pool.
This lab is the **Capstone Synthesis**. We will use high-precision diagnostics not just to find a bottleneck, but to untangle a **Cascading Failure** and execute a post-mortem replay.
---
### 🧪 Synthesis Challenge: The Cascading Collapse
**The Request**: "The whole site is slow. CPU is at 80%, Disk I/O is high, and there are hundreds of waiting connections. We don't know where to start."
#### The Naive Approach
Looking at a single signal. The "Sweat" (CPU) is high, so the developer assumes they need more cores. But the "Sigh" (I/O) is also high.
#### The Evidence (System Level)
We run the tracer with the **Time Model** view to see the primary signal across the entire cluster:
```bash
sudo ./pg_wait_tracer --view time_model --interval 10 --count 1
```
```text
Wait Class AAS % DB Time CPU% Wait%
────────── ────── ──────── ────── ──────
IO 15.20 45.0% 0% 100%
Lock 12.40 36.0% 0% 100%
CPU 6.40 19.0% 100% 0%
```
#### The Evidence (Process Level)
The Time Model shows I/O (45%) and Locking (36%) are fighting for dominance. We switch to the **Active view** to find the **Root Blocker**:
```bash
sudo ./pg_wait_tracer --view active --count 1
```
```text
PID State Wait Event Wait (ms) Query
────── ───────── ─────────────────── ────────── ────────────
8500 active IO:DataFileRead 8200.0 SELECT * FROM orders...
8501 waiting Lock:transactionid 7500.0 UPDATE orders SET...
8502 waiting Lock:transactionid 7100.0 UPDATE orders SET...
```
#### The Interpretation
The tracer exposes the **Chain of Causality**:
1. **Cause**: PID 8500 is running a massive, unoptimized Sequential Scan.
2. **Evidence (I/O)**: This scan is saturating the disk bandwidth (`IO:DataFileRead`).
3. **Evidence (Locking)**: Because the scan is so slow, it has been holding its row locks for 8 seconds.
4. **Evidence (Contention)**: PIDs 8501 and 8502 are `waiting` on those locks.
The Locking bottleneck is a *symptom* of the I/O bottleneck. No amount of CPU or Locking optimization will help. The only fix is to resolve the I/O stall.
#### The Strategic Fix
Add the missing index to the `orders` table.
#### The Payoff
The I/O waits vanish, the root query completes in 5ms, the locks are released, and the 36% "Locking" wait time disappears instantly.
---
### 🔍 Post-Mortem: Replaying the Crash
One of the most powerful features of high-precision tracing is the ability to analyze a failure **after it has finished**. If the `pg_wait_tracer` was running in **daemon mode** during the collapse, you can reconstruct the timeline of the crash.
**The Trigger**: The database crashed 10 minutes ago, and you need to prove what happened.
```bash
# Replaying the last 5 minutes of sampled data
sudo ./pg_wait_tracer --replay --window 5m
```
**The Interpretation**:
By replaying the trace, you can see the AAS "climb." If `CPU` rose first followed by `Lock:relation`, you had a code-path failure that turned into a locking failure. If `IO` rose first, you had a hardware or volume-saturation failure. The replay turns a mystery into a **Diagnostic Movie**.
---
### The Final Triage Protocol
When the system collapses, use this definitive sequence to restore order:
1. **Quantify the AAS**: Is the total AAS higher than your CPU core count? If yes, the database is "Over-Saturated."
2. **Identify the Class**:
* **CPU** → Complexity Failure (The query is working too hard).
* **IO** → Throughput Failure (The disk/memory floor is too low).
* **Lock** → Coordination Failure (The transaction window is too long).
* **LWLock** → Infrastructure Failure (The internal latches are jammed).
3. **Find the Root**: Use `--view active` to identify the PID at the head of the chain.
4. **Apply the Fix**:
* **CPU/IO** → Add Index / Optimize SQL.
* **Lock** → Terminate Blocker / Set Timeouts.
* **LWLock** → Connection Pooling / Tune Config.
> [!TIP]
> **The Golden Rule of Diagnostics**: Always solve for the "Sigh" (Wait Events) before you solve for the "Sweat" (CPU). A process that is waiting is a process that is failing to provide value.
---
## 7.11 - Summary (Wait Events & Concurrency)
> Latency is never a mystery; it is a strict accounting of time. A query is either burning cycles on the processor, or it is waiting—for a lock, for a disk block, or for the network. To optimize is simply to read the receipt.
<div style="page-break-after: always;"></div>