# CLI Redesign: Multi-Window, Event Hierarchy, ASH-like Query Analysis
## Context
Current CLI is TUI-only (screen clearing), cumulative-only, shows wait classes but not
individual events in time_model, and has no way to investigate query↔event relationships
like Oracle ASH. Need to support real DBA investigation workflows.
## Decisions Made
- Auto-detect TTY: terminal → `tui`, pipe → `text`
- 3 configurable time windows (not just delta+cumulative)
- `--count N` flag for exact snapshot control
- Prometheus: skip for now, `--format json` first
- CLI-first, daemon mode later (future phase)
- Auto-discover PG instance when only one cluster is running
- Semi-interactive "top-like" active sessions view (--sort flag, no ncurses)
- No query text initially — cmdline parsing only for backend info
- Query text from shared memory: planned (Phase E.2 — `st_activity` via `process_vm_readv`)
- Plan identifier from shared memory: planned (Phase E.3 — `st_plan_id`, PG18+ only)
---
## 1. Time Windows (`--window`)
3 configurable time windows. The DBA sees NOW, RECENTLY, and OVERALL.
```
--window 5s,1m,5m # default (first = --interval)
--window 10s,5m,30m # custom
--window 5s,1m,1h # longer history
```
First window always equals `--interval`. Format: `Ns`, `Nm`, `Nh`.
**Implementation**: Ring buffer of snapshots, one per tick. Size = largest_window / interval.
Each snapshot: time_model (88B) + compacted system_events (~5KB) + compacted query_events (~10KB).
For 1h at 5s: 720 * 15KB = 10.8MB. Acceptable.
Delta for window W = current_snapshot - snapshot_from_(W/interval)_ticks_ago.
---
## 2. Enhanced time_model: Event Hierarchy
Current time_model shows only class-level totals (Wait: IO, Wait: LWLock). The DBA
needs to see which specific events drive each class. Show top events per class as
indented subcategories:
```
════════════════════════════════════════════════════════════════════════════════
pg_wait_tracer — Time Model Backends: 12 Uptime: 32m 15s
════════════════════════════════════════════════════════════════════════════════
Stat Name Last 5s % DB Last 1m % DB Last 5m % DB
──────────────────────────────────────────────────────────────────────────────────────
DB Time 5088.6 100.0% 62340.1 100.0% 312450.8 100.0%
CPU 1498.6 29.4% 13712.3 22.0% 72312.1 23.1%
IO 862.3 16.9% 12340.5 19.8% 98234.2 31.4%
IO:DataFileRead 621.2 12.2% 9823.1 15.8% 82123.4 26.3%
IO:DataFileWrite 198.4 3.9% 2012.3 3.2% 12345.6 4.0%
IO:WALSync 42.7 0.8% 505.1 0.8% 3765.2 1.2%
LWLock 423.1 8.3% 4923.4 7.9% 21234.5 6.8%
LWLock:WALInsert 312.3 6.1% 3812.1 6.1% 16234.2 5.2%
LWLock:BufferContent 110.8 2.2% 1111.3 1.8% 5000.3 1.6%
Lock 312.4 6.1% 2123.1 3.4% 8921.3 2.9%
Lock:Transaction 301.2 5.9% 2023.4 3.2% 8512.1 2.7%
Client 88.2 1.7% 1512.3 2.4% 6234.1 2.0%
Client:ClientRead 88.2 1.7% 1512.3 2.4% 6234.1 2.0%
(Activity/Idle) 12560.4 — 62340.1 — 312450.8 —
```
This is the "one view to rule them all" — a DBA opens it and immediately knows:
- Which classes are hot (IO 16.9%)
- Which specific events within each class (DataFileRead 12.2%)
- How it compares across time windows (was 26.3% five minutes ago → improving)
**Design choices:**
- Show top 3 events per class (configurable with `--top N`?)
- Only show events contributing >= 1% of DB Time (avoid clutter)
- Classes with 0 time are hidden (current behavior)
- CPU has no sub-events (it's not a wait class), shown as single line
---
## 3. ASH-like Query Event Analysis
Oracle ASH lets DBAs answer two questions:
1. "Which queries are causing this wait event?" (event → queries)
2. "What is this query waiting on?" (query → events)
### New `--query-id` filter flag
```
--query-id <ID> Filter query_event view to one query
```
### query_event view modes
**Mode A: Default — top query-event combinations (current behavior)**
```bash
sudo pg_wait_tracer --pid 12345 --view query_event
```
```
query_id Wait Event Waits Total (ms) Avg (us) Max (us) % DB
5678234567890123 IO:DataFileRead 2340 1024.5 437.8 45623.1 4.1%
1234567890123456 Lock:Transaction 145 892.3 6153.8 892100.5 3.6%
5678234567890123 LWLock:WALInsert 890 334.2 375.5 8934.2 1.3%
```
**Mode B: Filter by event — top queries for a specific wait event**
```bash
sudo pg_wait_tracer --pid 12345 --view query_event --event IO:DataFileRead
```
```
════════════════════════════════════════════════════════════════════════════════
pg_wait_tracer — Top Queries for IO:DataFileRead Backends: 12
════════════════════════════════════════════════════════════════════════════════
query_id Waits Total (ms) Avg (us) Max (us) % Event % DB
5678234567890123 2340 1024.5 437.8 45623.1 31.4% 4.1%
9876543210987654 456 201.8 442.5 5432.1 6.2% 0.8%
1111222233334444 123 89.2 725.2 3214.5 2.7% 0.4%
```
Shows "% Event" = fraction of this event's total time. DBA sees: "query 5678... is responsible
for 31.4% of all DataFileRead time."
**Mode C: Filter by query_id — all events for one query**
```bash
sudo pg_wait_tracer --pid 12345 --view query_event --query-id 5678234567890123
```
```
════════════════════════════════════════════════════════════════════════════════
pg_wait_tracer — Wait Profile for query_id 5678234567890123 Backends: 12
════════════════════════════════════════════════════════════════════════════════
Wait Event Waits Total (ms) Avg (us) Max (us) % Query % DB
CPU 5432 1892.3 348.5 12340.1 58.2% 7.5%
IO:DataFileRead 2340 1024.5 437.8 45623.1 31.5% 4.1%
LWLock:WALInsert 890 334.2 375.5 8934.2 10.3% 1.3%
```
Shows "% Query" = fraction of this query's total time. DBA sees: "this query spends
58.2% of its time on CPU and 31.5% on DataFileRead."
### query_event with time windows
query_event should also support 3 windows when `--window` is set:
```
──── Last 5s ────────────────────────────────────────────────
query_id Wait Event Waits Total (ms) % DB
5678234567890123 IO:DataFileRead 23 10.2 0.2%
──── Last 1m ────────────────────────────────────────────────
query_id Wait Event Waits Total (ms) % DB
5678234567890123 IO:DataFileRead 234 102.5 0.2%
1234567890123456 Lock:Transaction 12 89.3 0.1%
```
---
## 4. Histogram Windows
Show 3 latency distributions side-by-side (one per window):
```bash
sudo pg_wait_tracer --pid 12345 --view histogram --event IO:DataFileRead --window 5s,1m,5m
```
```
════════════════════════════════════════════════════════════════════════════════
pg_wait_tracer — Histogram: IO:DataFileRead Uptime: 32m 15s
════════════════════════════════════════════════════════════════════════════════
Bucket(us) Last 5s % Last 1m % Last 5m %
──────────────────────────────────────────────────────────────────────────
<1 12 1.5% 123 1.3% 612 0.7%
1- 2 45 5.5% 512 5.3% 4312 5.1%
2- 4 183 22.3% 2132 22.0% 18234 21.5%
4- 8 210 25.6% 2505 25.8% 22123 26.1%
8- 16 153 18.6% 1843 19.0% 16234 19.2%
16- 32 98 11.9% 1234 12.7% 10812 12.8%
32- 64 54 6.6% 623 6.4% 5234 6.2%
64-128 31 3.8% 378 3.9% 3123 3.7%
128-256 17 2.1% 198 2.0% 1812 2.1%
256-512 9 1.1% 95 1.0% 812 1.0%
512-1K 5 0.6% 42 0.4% 412 0.5%
1K- 2K 3 0.4% 18 0.2% 198 0.2%
>=16K 1 0.1% 1 0.0% 12 0.0%
```
DBA reads: "Distribution is stable across windows — no latency shift."
---
## 5. Auto-Discovery of PostgreSQL Instance
Currently `--pid` or `--pgdata` is required. For ad-hoc DBA use, auto-detect the
PostgreSQL postmaster when neither is specified:
```
# Auto-detect: finds the single running postmaster
sudo pg_wait_tracer
# Explicit PID (unchanged)
sudo pg_wait_tracer --pid 12345
# Via PGDATA (unchanged)
sudo pg_wait_tracer --pgdata /var/lib/pgsql/17/data
```
**Algorithm** (port from `tests/testutil.sh:find_postmaster()`):
1. `pgrep -x postgres` → list candidate PIDs
2. Filter children: skip PIDs whose parent comm is also "postgres"
3. If exactly 1 postmaster → use it automatically
4. If multiple postmasters → list them with version + PGDATA and FATAL:
```
Multiple PostgreSQL instances found:
PID 1234 PG17 /var/lib/pgsql/17/data
PID 5678 PG18 /var/lib/pgsql/18/data
Use --pid <PID> or --pgdata <DIR> to select one.
```
5. If none found → FATAL: "No running PostgreSQL instance found"
**Version detection**: extract from exe path (`/usr/pgsql-17/bin/postgres` or
`/usr/lib/postgresql/17/bin/postgres`) via `readlink /proc/PID/exe`.
**PGDATA detection**: read `/proc/PID/environ` for `PGDATA=`, or parse cmdline
for `-D` flag.
**Do we need both `--pid` and `--pgdata`?** Keep both:
- `--pid` for multi-instance hosts (direct, unambiguous)
- `--pgdata` for systemd integration and scripts (reads `postmaster.pid`)
- Auto-detect for the common single-instance ad-hoc case
---
## 6. Active Sessions View (`--view active`)
A "top-like" refreshing view of currently active backends. This is what a DBA
opens first to see what's happening right now.
```bash
sudo pg_wait_tracer --view active
sudo pg_wait_tracer --view active --sort wait_time
sudo pg_wait_tracer --view active --sort db_time
```
```
════════════════════════════════════════════════════════════════════════════════
pg_wait_tracer — Active Sessions Backends: 12/100 Uptime: 32m 15s
════════════════════════════════════════════════════════════════════════════════
PID State Wait Event Wait (ms) DB Time (ms) Backend Type
────────────────────────────────────────────────────────────────────────────────
34521 waiting Lock:Transaction 8923.1 12450.3 client backend
34587 waiting IO:DataFileRead 3.2 8234.1 client backend
34602 on cpu — — 5123.4 client backend
34534 waiting LWLock:WALInsert 0.8 4892.1 client backend
34498 waiting Client:ClientRead 1234.5 3421.2 client backend
34612 idle — — — client backend
34701 active — — — autovacuum worker
34702 active — — — wal writer
```
**Columns:**
- **PID**: Backend OS PID
- **State**: `on cpu` | `waiting` | `idle` (from BPF tracing state)
- **Wait Event**: Current wait event (if waiting), `—` otherwise
- **Wait (ms)**: How long in current wait state (from BPF timestamp delta)
- **DB Time (ms)**: Total DB Time for this backend (cumulative)
- **Backend Type**: From cmdline parsing (`client backend`, `autovacuum worker`,
`wal writer`, `checkpointer`, `bgwriter`, `walreceiver`, etc.)
**Sorting** (`--sort` flag):
```
--sort wait_time Sort by current wait duration (default)
--sort db_time Sort by cumulative DB Time
--sort pid Sort by PID
--sort event Sort by wait event name
```
**Semi-interactive**: Not full ncurses. Refreshes with screen clear (TUI mode),
supports `--sort` flag on command line. No runtime key bindings for now.
**Backend info from cmdline**: PostgreSQL writes the backend type into
`/proc/PID/cmdline` (e.g., `postgres: autovacuum worker`). Parse this for the
Backend Type column. No query text — just type identification.
**Future**: Read query text from shared memory (`PgBackendStatus.st_activity`)
for an additional column showing the current SQL statement.
---
## 7. Session Event Windowing
**Summary mode** (default `--view session_event`): Shows only the latest interval.
No time windows — the per-backend snapshot is already interval-sized and useful as-is.
**Detail mode** (`--view session_event --pid-filter <PID>`): Shows time windows
for a single backend. This is for deep-diving one specific problematic session.
```bash
sudo pg_wait_tracer --view session_event --pid-filter 34521 --window 5s,1m,5m
```
```
════════════════════════════════════════════════════════════════════════════════
pg_wait_tracer — Session Detail: PID 34521 Backend: client backend
════════════════════════════════════════════════════════════════════════════════
Wait Event Last 5s % Last 1m % Last 5m %
──────────────────────────────────────────────────────────────────────────
Lock:Transaction 4921.3 96.8% 8923.1 51.2% 12450.3 28.1%
IO:DataFileRead 98.2 1.9% 5234.1 30.0% 18234.1 41.2%
CPU 64.1 1.3% 3271.8 18.8% 13621.2 30.7%
```
DBA reads: "PID 34521 is currently stuck on Lock:Transaction (96.8% of last 5s),
but historically it was mostly doing DataFileRead (41.2% of last 5m)."
---
## 8. Filtering
### `--class` filter
```
--class IO Show only IO events in system_event
--class IO,LWLock Show only IO and LWLock events
```
### `--min-pct` threshold
```
--min-pct 1 Hide events below 1% of DB Time (reduce clutter)
```
### `--top N` for time_model subcategories
```
--top 5 Show top 5 events per class in time_model (default: 3)
```
---
## 9. `--format` flag
```
--format tui Screen-clearing interactive (default for terminal)
--format text No screen clear, timestamp per interval (default for pipes)
--format json JSONL — one JSON object per interval
--format csv Flat rows, one per event per interval
```
Auto-detect: `isatty(stdout)` → tui, else → text.
**JSON includes everything** — all windows, event hierarchy, query details.
---
## 10. `--count N` flag
```
--count 1 One-shot: collect for one interval, print, exit
--count 10 Print 10 intervals then exit
```
---
## 11. Full CLI Summary
```
Usage: pg_wait_tracer [OPTIONS]
Target (auto-detect if omitted, single instance):
-p, --pid <PID> Postmaster PID
-D, --pgdata <DIR> PGDATA directory (reads postmaster.pid)
Views:
-V, --view <VIEW> time_model | system_event | session_event |
histogram | query_event | active
Output control:
-f, --format <FMT> tui | text | json | csv (default: auto-detect)
-i, --interval <SEC> Refresh interval (default: 5)
-d, --duration <SEC> Stop after N seconds
-n, --count <N> Stop after N intervals
-w, --window <W1,W2,W3> Time windows (default: interval only)
Filters:
-e, --event <NAME> Event filter (histogram: required; query_event: filter by event)
-P, --pid-filter <PID> Show detail for specific backend (session_event)
-Q, --query-id <ID> Filter query_event to one query
-C, --class <CLASS> Filter by wait class (system_event)
--min-pct <N> Hide events below N% of DB Time
--top <N> Top events per class in time_model (default: 3)
--sort <COL> Sort column for active view (wait_time|db_time|pid|event)
Other:
-v, --verbose Verbose output to stderr
-h, --help Show this help
```
---
## 12. Implementation Order
| Phase | What | Effort |
|-------|------|--------|
| **Phase 1** | Auto-discovery + `--count` + `--format` infrastructure + TTY auto-detect | Medium |
| **Phase 2** | Snapshot ring buffer + `--window` parsing + delta computation | Medium |
| **Phase 3** | Enhanced time_model with event hierarchy (subcategories) | Medium |
| **Phase 4** | Multi-window time_model (side-by-side columns) | Medium |
| **Phase 5** | Multi-window system_event (3 sections) | Small |
| **Phase 6** | ASH-like query_event (--event filter, --query-id filter) | Medium |
| **Phase 7** | Multi-window histogram | Small |
| **Phase 8** | Active sessions view (--view active, --sort, cmdline parsing) | Medium |
| **Phase 9** | Session event windowing (--pid-filter with windows) | Small |
| **Phase 10** | Text format (no screen clear, timestamps) | Small |
| **Phase 11** | JSON format | Medium |
| **Phase 12** | CSV format | Small |
| **Phase 13** | Filtering (--class, --min-pct, --top) | Small |
| **Phase 14** | Update README + tests | Medium |
| **Phase 15** | Recording & replay (`--record`, `--replay`, `--from`, `--to`) | Medium |
| **Phase 16** | SQL query text exposure (shared memory or eBPF uprobe) | Medium-Large |
---
## 13. Files to Modify
| File | Changes |
|------|---------|
| `src/pg_wait_tracer.h` | Format/window/sort enums, new CLI field structs, auto-detect flag |
| `src/pg_wait_tracer.c` | Parse all new flags, TTY auto-detect, postmaster auto-discovery |
| `src/daemon.h` | format, count, windows[], ring_buffer, query_id filter, sort to pgwt_daemon |
| `src/daemon.c` | Snapshot per tick, count exit, format dispatch, active sessions collect |
| `src/map_reader.h` | Snapshot struct, ring buffer, window delta functions |
| `src/map_reader.c` | Snapshot save, window delta computation |
| `src/output.h` | Format-aware signatures, active sessions output |
| `src/output.c` | Event hierarchy in time_model, multi-window columns/sections, ASH query modes, histogram windows, active sessions view, session detail windows, conditional screen clear, text format |
| `src/cmdline.c` | **New** — Parse /proc/PID/cmdline for backend type |
| `src/cmdline.h` | **New** — Backend type enum and parser |
| `src/output_json.c` | **New** — JSON formatter |
| `src/output_csv.c` | **New** — CSV formatter |
| `src/recording.c` | **New** — Snapshot recording and replay |
| `src/recording.h` | **New** — Recording file format and API |
| `src/query_text.c` | **New** — Query text reader (shmem or BPF, Phase 16) |
| `src/query_text.h` | **New** — Query text API |
| `Makefile` | Add new .c files |
---
## 14. Future: Recording & Replay (SAR-like Time Travel)
Record snapshots to disk for offline analysis. Like `sar -o`/`sar -f`, the DBA can
capture a performance recording during a problem window, then analyze it later — or
share it with another DBA for review.
### Recording Mode
```bash
# Record snapshots to a binary file (runs like normal, also writes to disk)
sudo pg_wait_tracer --record perf_issue.pgwt --interval 5 --duration 3600
# Record with specific views' data (all snapshot data is always recorded)
sudo pg_wait_tracer --record overnight.pgwt --interval 5
```
Each tick writes one `pgwt_snapshot` (time_model + system_events + query_events) plus
a timestamp header to the file. File format:
```
[file header: magic, version, interval, PG version, start time]
[snapshot 0: timestamp + pgwt_snapshot]
[snapshot 1: timestamp + pgwt_snapshot]
...
```
File size: ~15KB per snapshot. At 5s interval: 720 snapshots/hour = ~10.8MB/hour.
24 hours = ~260MB. Acceptable for investigation recordings.
### Replay Mode
```bash
# Replay entire recording with default view
sudo pg_wait_tracer --replay perf_issue.pgwt
# Replay a specific time range
sudo pg_wait_tracer --replay perf_issue.pgwt --from "2025-01-15 14:00" --to "14:05"
# Replay with specific view and windows
sudo pg_wait_tracer --replay perf_issue.pgwt --from "14:00" --to "14:30" \
--view query_event --window 5s,1m,5m
# One-shot summary of a time range
sudo pg_wait_tracer --replay perf_issue.pgwt --from "14:00" --to "14:05" --count 1
```
Replay loads snapshots from file into the ring buffer, then renders views using the
same delta logic as live mode. `--from`/`--to` select the time range. Without them,
replays from start to end.
### Key Design Points
- **Same views, same code**: Replay populates the ring buffer from file instead of
from BPF maps. All output functions (`pgwt_print_*`) work unchanged.
- **No BPF needed for replay**: Replay is read-only, no root required for viewing.
- **Delta computation reuse**: `pgwt_ring_delta()` already computes arbitrary deltas.
Replay just needs to load the right snapshots into the ring.
- **Active sessions not replayable**: The active view shows real-time per-backend state
from BPF state_map, which is not captured in snapshots. Recording could optionally
store per-backend state too (future extension).
### New CLI Flags
```
--record <FILE> Write snapshots to file while tracing
--replay <FILE> Replay from recorded file (no BPF, no root needed for viewing)
--from <TIME> Start time for replay (ISO 8601 or HH:MM)
--to <TIME> End time for replay
```
### Implementation Approach
1. Define binary file format with header (magic, version, interval, metadata)
2. In `handle_timer()`, after ring push, also write snapshot to file if recording
3. New `pgwt_replay_load()` that reads file, populates ring buffer for time range
4. Replay main loop: step through snapshots, render view per tick (or single summary)
5. File ~200-300 lines: `src/recording.c` / `src/recording.h`
---
## 15. SQL Query Text Exposure
**Decision made**: Option A (shared memory) — read `PgBackendStatus.st_activity`
via `process_vm_readv()`. Captures the actual running SQL (not normalized).
See [ROADMAP.md](ROADMAP.md) Phase E.2 for full implementation plan.
Show the currently executing SQL statement for each backend. Two implementation
approaches were evaluated:
### Where Query Text Appears
- **Active sessions view**: New `Query` column (truncated to ~60 chars) showing the
current statement for each backend
- **query_event view**: Optionally map `query_id` → query text for display
- **session_event detail**: Show current query for the filtered PID
Example active view with query text:
```
PID State Wait Event Wait (ms) DB Time (ms) Query
────────────────────────────────────────────────────────────────────────────────
34521 waiting Lock:Transaction 8923.1 12450.3 UPDATE accounts SET ba...
34587 waiting IO:DataFileRead 3.2 8234.1 SELECT * FROM orders W...
34602 on cpu — — 5123.4 INSERT INTO logs (ts, ...
34612 idle — — — —
```
### Option A: Shared Memory (`PgBackendStatus.st_activity`)
PostgreSQL maintains `PgBackendStatus` in shared memory for each backend. The
`st_activity` field contains the current query text (up to `track_activity_query_size`,
default 1024 bytes). This is what `pg_stat_activity.query` reads.
**How it works:**
1. Find the `BackendStatusArray` symbol in the postgres binary (ELF lookup, like we
do for `my_wait_event_info`)
2. For each backend PID, read `PgBackendStatus.st_activity` from `/proc/PID/mem`
or via `process_vm_readv()`
3. Display truncated text in output
**Pros:**
- Simpler implementation (~100-150 lines)
- No additional BPF programs needed
- Reads the same data `pg_stat_activity` shows — well-understood semantics
- Works for all backend types that set `st_activity`
**Cons:**
- Shared memory layout is PG-version-dependent (struct offsets change between major
versions). Need per-version offset tables or runtime discovery via DWARF/debuginfo.
- `BackendStatusArray` is a static variable — need to resolve its address per version.
May require debuginfo packages on some distros.
- Read is a point-in-time snapshot — slight race with backend updating the field.
Acceptable for display purposes (same race `pg_stat_activity` has).
- `track_activity_query_size` limits text length (default 1024, configurable).
**Implementation sketch:**
```c
/* Resolve BackendStatusArray address from ELF symbols */
uintptr_t status_array = pgwt_resolve_symbol("BackendStatusArray");
/* For each backend, read st_activity */
struct iovec local = { .iov_base = buf, .iov_len = 1024 };
struct iovec remote = { .iov_base = (void *)(status_array + idx * sizeof_entry + activity_offset),
.iov_len = 1024 };
process_vm_readv(pid, &local, 1, &remote, 1, 0);
```
### Option B: eBPF Uprobe on Query Execution
Attach a uprobe to a query execution function (e.g., `exec_simple_query()`,
`PortalRun()`, or `pgstat_report_activity()`) and capture the query string via
`bpf_probe_read_user()`.
**How it works:**
1. Attach uprobe to `exec_simple_query(const char *query_string)` — called for
every simple query protocol message
2. In the BPF program, read the query string argument and store it in a per-PID
hash map (`query_text_map`)
3. Userspace reads `query_text_map` alongside other maps during each tick
**Pros:**
- No shared memory layout dependency — reads function arguments directly
- Version-resilient: `exec_simple_query` signature has been stable for 15+ years
- Natural fit with existing BPF infrastructure (same attach/read pattern)
- Can also hook extended query protocol (`exec_parse_message`, `exec_bind_message`)
**Cons:**
- `bpf_probe_read_user()` has a size limit in BPF programs (typically ~256 bytes per
read, can do multiple reads but adds complexity). Long queries get truncated.
- BPF hash map value size must be fixed at compile time. Storing 1024 bytes per PID
in a BPF map is expensive: 1024 * max_backends. With 1000 backends = ~1MB map.
- Extended query protocol: prepared statements execute via `PortalRun()` where the
query text isn't a direct argument — need to hook `exec_parse_message` to capture
the text at parse time and correlate with portal execution.
- `pgstat_report_activity()` is called more often but has the text as an argument —
could be a simpler single hook point that covers both protocols.
- Additional uprobe = additional overhead (though likely negligible — one read per
query start, not per wait event).
**Implementation sketch:**
```c
/* BPF program attached to exec_simple_query */
SEC("uprobe/exec_simple_query")
int handle_query(struct pt_regs *ctx)
{
const char *query = (const char *)PT_REGS_PARM1(ctx);
u32 pid = bpf_get_current_pid_tgid() >> 32;
char buf[256];
bpf_probe_read_user_str(buf, sizeof(buf), query);
bpf_map_update_elem(&query_text_map, &pid, buf, BPF_ANY);
return 0;
}
```
### Decision Criteria
| Factor | Option A (shmem) | Option B (eBPF) |
|--------|-------------------|-----------------|
| Implementation complexity | Medium | Medium-Large |
| PG version sensitivity | High (struct layout) | Low (function signature stable) |
| Debuginfo dependency | Likely needed | Not needed |
| Text length | Full (1024 default) | Limited (~256 per read) |
| Overhead | Minimal (read on tick) | Minimal (one write per query) |
| Extended query protocol | Handled (always in st_activity) | Needs extra hooks |
| Consistency with tool | Different pattern (shmem) | Same pattern (BPF) |
**Decision**: Option A chosen. `st_activity` gives the actual running SQL with real
parameter values, which is more useful for debugging than normalized queries. We
deduplicate by `query_id` and keep the first-seen text as representative.
---
## 16. Future: Daemon Mode (not in Phase 1-16)
After the CLI is feature-complete, add a daemon mode for continuous monitoring:
**Architecture:**
```
pg_wait_tracer --daemon # background process, BPF tracing + ring buffer
pg_wait_tracer --view time_model # CLI client, connects to daemon via Unix socket
```
- Daemon runs as a long-lived process, maintains the ring buffer and BPF programs
- CLI client connects to daemon, requests specific views with filters
- Avoids re-attaching BPF probes for every ad-hoc query
- Multiple CLI clients can connect simultaneously
- PG extension wrapper: a thin SQL interface that talks to the daemon
**Benefits:**
- Start with PostgreSQL (systemd companion service)
- No BPF attach/detach overhead for each investigation
- Foundation for Prometheus exporter (daemon exposes metrics endpoint)
- PG extension can expose views like `pg_wait_tracer_time_model()`
**Protocol**: Simple request/response over Unix domain socket. Client sends view
name + filters as JSON, daemon responds with JSON data. CLI client formats output.
**Not in scope for Phase 1-16** — this is a separate future project after the CLI
redesign is complete and validated.
---
## 17. Query Plan Identification
**Decision made**: Capture `st_plan_id` from `PgBackendStatus` in shared memory (PG18+).
See [ROADMAP.md](ROADMAP.md) Phase E.3 for full implementation plan.
**Key finding**: PostgreSQL 18 added `int64 st_plan_id` to `PgBackendStatus` right
after `st_query_id`. It's readable from shared memory via the same `process_vm_readv()`
technique — essentially free (extend existing read by 8 bytes).
**Limitation**: Core PG does **not** compute `planId` automatically. Users need a
`planner_hook` extension (e.g., `pg_store_plans`, `pg_stat_sql_plans`). Without one,
`st_plan_id` is always 0. On PG17, the field does not exist.
**Full execution plans are not feasible** to capture passively:
- Plan tree is in backend-local memory (complex pointer-chasing structure)
- Serializing from eBPF is impractical (dozens of node types)
- `auto_explain` captures actual plans but only to PG log (complex parsing)
- `pg_store_plans` stores plan text indexed by (queryid, planid) — best option
if users want the actual plan text alongside the plan hash
**Practical investigation flow with plan_id**:
1. See wait time spike in AAS chart
2. Drill to query causing the spike
3. See `plan_id` changed at the spike time → plan regression confirmed
4. Use `pg_store_plans` or `EXPLAIN` to examine the specific plan
---
## 18. Verification
```bash
# Auto-detect single instance
sudo ./pg_wait_tracer
# Current behavior unchanged
sudo ./pg_wait_tracer --pid 12345
# Enhanced time_model with event subcategories
sudo ./pg_wait_tracer --pid 12345 --view time_model
# Multi-window
sudo ./pg_wait_tracer --pid 12345 --window 5s,1m,5m
# Active sessions (top-like)
sudo ./pg_wait_tracer --view active
sudo ./pg_wait_tracer --view active --sort db_time
# Session detail with windows
sudo ./pg_wait_tracer --view session_event --pid-filter 34521 --window 5s,1m,5m
# ASH-like: which queries cause DataFileRead?
sudo ./pg_wait_tracer --pid 12345 --view query_event --event IO:DataFileRead
# ASH-like: what does this query wait on?
sudo ./pg_wait_tracer --pid 12345 --view query_event --query-id 5678234567890123
# Histogram comparison across windows
sudo ./pg_wait_tracer --pid 12345 --view histogram --event IO:DataFileRead --window 5s,1m,5m
# One-shot JSON
sudo ./pg_wait_tracer --pid 12345 --format json --count 1 --interval 10
# Pipe to file (auto text format)
sudo ./pg_wait_tracer --pid 12345 --count 5 > output.log
# Multiple instances: auto-detect fails, shows list
sudo ./pg_wait_tracer
# → "Multiple PostgreSQL instances found: ..."
# Recording & replay (SAR-like)
sudo ./pg_wait_tracer --record /tmp/perf.pgwt --interval 5 --count 10
./pg_wait_tracer --replay /tmp/perf.pgwt
./pg_wait_tracer --replay /tmp/perf.pgwt --from "14:00" --to "14:05" --view query_event
# Active sessions with query text (Phase 16)
sudo ./pg_wait_tracer --view active --show-query
# Existing tests
sudo tests/run_all.sh
```