# Chapter 8: Distributed Scaling & Clouds ## 8.0 - Distributed Storage (The Elephant in the Clouds) <img src="assets/chap_7_cloud_scales.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> For a long time, the PostgreSQL architecture was strictly monolithic. Postgres lived on a single server, managing its own local memory, local CPU, and a physical inventory of local disks. Scaling was **Vertical**: you simply bought a larger machine. ### What You'll Learn - The difference between **Vertical** and **Horizontal** scaling in Postgres - How **Read Replicas** use WAL streaming to distribute read workloads - How **Compute/Storage Decoupling** enables cloud-native features (instant branching, auto-scaling) - The **Replication Lag** trade-off: scale vs. immediate global consistency But as global data volumes increased, the architectural boundary of the single server became a bottleneck. In this chapter, we explore how the durability of the **Write-Ahead Log (WAL)** allows Postgres to transcend the limits of a single machine. By decoupling Postgres's "Brain" (Compute) from its "Memory" (Storage), we can build distributed architectures that span multiple availability zones and scale independently of physical disk constraints. ### The Axis of Scale Traditional scaling involves two primary directions: - **Vertical Scaling (Up)**: Increasing the resources of a single machine (CPU, RAM). This is bounded by the current limits of physical hardware and often requires downtime for upgrades. - **Horizontal Scaling (Out)**: Adding more nodes to the system. In Postgres, this is achieved by broadcasting the WAL stream to secondary nodes, creating a team of **Read Replicas** that share the read-only workload. ### Compute/Storage Decoupling The most radical evolution in modern Postgres is the complete separation of the database engine from the underlying storage layer. By intercepting Postgres's WAL stream and shipping it across the network to specialized storage nodes, cloud-native architectures (like **Amazon Aurora**, **Google AlloyDB**, and **Neon**) can perform architectural miracles: - **Instant Branching**: Cloning a 10TB database in seconds by creating a virtual pointer to a shared storage layer. - **Auto-Scaling Storage**: A database that grows dynamically across a distributed swarm of disks without ever requiring manual disk expansion. - **High Availability**: If the primary compute node fails, a new one can be "re-hydrated" from the distributed storage layer almost instantly. The database engine has effectively reached a new state of architectural maturity. It still processes queries and makes plans, but it has delegated the mechanical labor of data persistence to an infinite, resilient, and distributed storage layer. > [!WARNING] > **The Price of Distribution**: When you scale out, you must manage **Replication Lag**. Because data takes time to travel the network, secondary nodes may be milliseconds (or seconds) behind the primary. This introduced the concept of **Eventual Consistency**β€”a technical tradeoff between scale and immediate global visibility. --- ## 8.1 - Read Replicas (The Shouting Gallery) <img src="assets/arch_read_replicas_v2.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> To scale a database horizontally, you can add **Read Replicas**. These are additional Postgres instances that replay the primary's writes to serve read-only traffic. This architecture distributes read load while maintaining a single authoritative primary for write operations. However, it introduces the architectural challenge of **Replication Lag**, where data takes time to travel over the network and be applied to the replica's storage. ### The Mechanism: WAL Shipping Postgres implements **Physical Replication** through **WAL Shipping**. Every modification to the database is first recorded in the **[[Manuscript/05 - Durability & Transactions/5.1 - WAL & fsync (The Durable Ledger)|Write-Ahead Log (WAL)]]**. The primary's **Walsender** process transmits these records over the network to the replica's **Walreceiver** process. The replica operates in **Hot Standby** mode. It has no authority to perform write operations; it strictly replays the primary's WAL into its local data pages to maintain consistency. ### Replication Modes: Sync vs. Async The coordination between the Primary and Standby is governed by the `synchronous_commit` and `synchronous_standby_names` settings. - **Asynchronous (Default)**: The Primary commits the transaction and returns success to the application immediately after writing its local WAL. It transmits the WAL stream to replicas in the background. If the Primary fails before the record is received by a standby, data may be lost. - **Synchronous**: The Primary waits for one or more replicas to confirm receipt of the WAL record before returning success to the application. This ensures zero data loss (RPO = 0) but increases transaction latency by the duration of the network round-trip. ### The Read-Only Constraint Replicas are strictly **Read-Only**. Any attempt to perform DML (`INSERT`, `UPDATE`, `DELETE`) or DDL (`CREATE TABLE`) on a replica will result in an error. The replica is a deterministic follower whose state is entirely driven by the primary's WAL stream. ### Monitoring Health: Replication Lag **Lag** is the distance between the Primary's state and the Replica's state, measured in bytes (LSN distance) or time. You can monitor this from the primary: ```sql SELECT application_name, pg_size_pretty(sent_lsn - replay_lsn) AS lag_bytes FROM pg_stat_replication; ``` Postgres tracks the replication gap through four LSN (Log Sequence Number) markers: 1. **`sent_lsn`**: The Primary sent the WAL record over the network. 2. **`write_lsn`**: The Standby received the WAL in its OS buffer. 3. **`flush_lsn`**: The Standby persisted the WAL to disk via `fsync`. 4. **`replay_lsn`**: The Standby applied the WAL to its data pages, making the change visible to local queries. On the replica node itself, you can measure the "time" delay: ```sql -- On the REPLICA: How old is the most recent data? SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay; ``` --- > [!WARNING] > **Read-Your-Writes Consistency**: If an application writes to the Primary and immediately reads from a Standby, it may find the data missing due to replication lag. Applications requiring immediate consistency should route reads to the Primary or use synchronous replication for critical paths. --- ## 8.1.1 - synchronous_commit (The Durability Gradient) <img src="assets/arch_durability_gradient.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> In a distributed cluster, the definition of a "Permanent Record" is not a binary state. Postgres allows you to tune the exact moment a transaction is considered successfully committed, trading off **Durability** for **Performance**. This is controlled by the `synchronous_commit` setting. ### The Spectrum of Certainty When an application sends a `COMMIT` command, the primary node must decide how many "echoes" it needs to hear from its replicas before it tells the user "Success." 1. **`off` (Async/Local)**: The primary returns success as soon as the WAL is in the buffer, without waiting for the local disk `fsync`. This is the fastest setting but carries a risk of losing the last few milliseconds of data if the primary crashes. 2. **`local`**: The primary waits for its own local disk to confirm the `fsync` has finished. Replicas are completely ignored for the commit decision. 3. **`remote_write` (Waits for `write_lsn`)**: The primary waits for the replica to acknowledge that it has received the WAL and written it to its **operating system memory** (but not necessarily to its physical disk). 4. **`on` (Waits for `flush_lsn`)**: The primary waits for the replica to confirm the WAL has been **bolted down to disk** (`fsync` performed on the replica). This is the default setting for synchronous replication. 5. **`remote_apply` (Waits for `replay_lsn`)**: The primary waits for the replica to confirm the WAL has been **replayed**. This ensures that anyone reading from the replica immediately after the commit will see the change (Zero "Read-Your-Writes" lag). ### The Cost of the Echo Every level of certainty adds **Latency**. If you use `synchronous_commit = on`, every single transaction must pay the **Network Round-Trip Time (RTT)** to the replica. If your replica is in a different data center 10ms away, your minimum transaction time is now 10ms, regardless of how fast your CPU is. > [!IMPORTANT] > **Wait Event Analysis**: In `pg_stat_activity`, a process waiting for a synchronous commit will show the wait event **`SyncRep`**. If you see this event dominating your workload, your network is the bottleneck, not your database. ### Choosing Your Safety | Level | Risk | Performance | Use Case | | :------------- | :---------------------------- | :---------- | :-------------------------------------------- | | `off` | Loss of ~10-100ms on crash | Extreme | Non-critical logs, transient state. | | `local` | Loss of data if Primary dies | High | Standard async replication; focus on speed. | | `on` | Data is safe if Primary dies | Medium | Financial transactions; high-value records. | | `remote_apply` | Zero "Read-Your-Writes" lag | Low | When replicas MUST be identical to primary. | By understanding the durability gradient, you can ensure that your database is only as slow as your data's value requires it to be. ### πŸ§ͺ Lab Challenge: The Durability Gradient (Sync vs. Async) **The Request**: "Our database is extremely slow during peak hours. Every simple `INSERT` is taking 25ms, even though the disk is an NVMe." #### The Investigation Check the wait events in `pg_stat_activity` for active writers: ```sql SELECT pid, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state = 'active' AND query LIKE 'INSERT%'; ``` **Result**: ```text pid | wait_event_type | wait_event | query -------+-----------------+------------+------------------ 105 | IPC | SyncRep | INSERT INTO... ``` #### The Diagnosis The process is stalled on **`SyncRep`**. This means your database is technically fast, but it is waiting for a **Network Echo** from a synchronous replica. The 25ms latency isn't the diskβ€”it's the distance between your primary and your replica. #### The Lazy Fix If the data is non-critical (like transient logs or analytics), lower the durability requirements for that specific session: ```sql -- Switch to Async mode for this session only SET synchronous_commit = 'off'; INSERT INTO logs (...) VALUES (...); ``` #### The Reward The transaction returns in under 1ms. By moving from a **Synchronous Flush** (waiting for the replica) to a **Local Buffer** (writing to memory and returning), you have bypassed the network bottleneck. > [!CAUTION] > **The Trade-off**: With `synchronous_commit = off`, you risk losing the last ~100ms of data if the primary crashes. Only use this for workloads where speed is more valuable than absolute persistence. --- ## 8.1.2 - xmin Horizon (Hot Standby Feedback) <img src="assets/arch_hot_standby_xmin.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Physical replication creates a unique conflict. The primary node is responsible for cleaning up "dead" tuples (**Vacuum**), but it has no way of knowing if a replica node is currently reading those exact same tuples in a long-running query. This leads to the **Recovery Conflict**. ### The Conflict of Interest As we learned in **[[Manuscript/02 - Physical Storage & MVCC/2.5 - MVCC (The Sharpie Ledger)|Chapter 2.5]]**, tuples are marked as "dead" but not immediately removed. Eventually, the **[[Manuscript/06 - Resource Management & Processes/6.4 - Vacuum & Freezing (The Housekeepers)|Vacuum]]** process reclaims those tuples to reuse the space. On a replica: 1. The primary reclaims a tuple and records the operation in the WAL. 2. The replica is executing a reporting query that requires that specific tuple version. 3. The replica receives the WAL instruction to remove the tuple. If the replica removes the tuple immediately, the query fails with: `ERROR: canceled on conflict with recovery`. ### The Solution: Hot Standby Feedback To prevent this, the replica can send a continuous status report back to the primary. This is enabled via the `hot_standby_feedback = on` setting. The replica communicates its **`xmin` horizon**β€”the ID of the oldest transaction it is currently viewing. The primary monitors this report and instructs its own Vacuum process to defer reclamation of any tuples newer than this horizon. ### The Trade-off: Bloat Propagation While feedback prevents query failures on replicas, it introduces a risk of **Table Bloat**. If a replica has a long-running query that pins the `xmin` horizon to a very old transaction, the primary's Vacuum cannot reclaim space. Dead tuples begin to accumulate on the **primary** node. > [!CAUTION] > **The Bloat Anchor**: A single long-running query on a **replica** can cause your **primary** server to run out of disk space. You must monitor `pg_stat_replication` for old `backend_xmin` values to ensure a replica isn't inadvertently paralyzing primary maintenance. By monitoring the horizon, you balance the needs of your long-running reporting queries against the physical health of the primary server's disk. --- ## 8.2 - Cloud Storage (Compute and Storage Separation) <img src="assets/arch_cloud_wal.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> In a traditional database server, the **[[Manuscript/06 - Resource Management & Processes/6.0 - Memory & Disk (The Hierarchy of Inertia)|Storage Layers]]** are physical entities bolted to the hardware. If the server burns down, the local machine's context is gone. The cloud rejects this physical dependency. In a traditional Postgres deployment, the **[[Manuscript/02 - Physical Storage & MVCC/2.3 - The Page (The Shipping Container)|Filesystem]]** is a local resource. Data must move through the OS kernel and local I/O controller to reach physical disk. Modern cloud-native solutions (such as **Amazon Aurora**, **Google AlloyDB**, and **Neon**) dismantle this architecture by implementing **Compute/Storage Separation**. ### The Architecture of Separation Instead of managing a local disk, Postgres (Compute) communicates over the network with a distributed storage fleet. The fundamental innovation of these systems is that the storage layer of the cloud **speaks the language of WAL**. When a change is made: 1. The Compute node generates a **WAL record** with a specific **LSN (Log Sequence Number)**. 2. Instead of writing this to a local block device, the compute node ships the WAL record to multiple storage nodes simultaneously. 3. The storage node receives the WAL, acknowledges the write, and then asynchronously applies the WAL to its own versions of the physical data pages. In this model, the **Log is the Database**. The physical data pages (page) are merely a cache of the applied WAL records. ### The Cloud-Native Pioneers By decoupling Compute from Storage, cloud providers have engineered specialized architectures that transcend traditional hardware limits. #### 1. Amazon Aurora: Log-Structured Fault Tolerance In a traditional architecture, Postgres must write both the WAL and the actual data pages to disk, causing heavy Checkpoint and Full-Page Write I/O. Aurora completely eliminates this. Aurora's storage layer spans three Availability Zones (AZs). Every single piece of data is sliced into 10GB segments, and each segment is replicated 6 times (2 copies per AZ). The compute node only sends WAL records over the network. The intelligent storage nodes asynchronously "materialize" the database pages in the background, drastically reducing network I/O and entirely eliminating the traditional Checkpoint bottleneck. #### 2. Neon: Scale-to-Zero and Git-Like Branching Because Neon's compute nodes (the Postgres executors) are entirely stateless and store absolutely zero durable data, they can be treated as ephemeral workers. If a database receives no traffic, Neon automatically spins the compute instance completely down to zero, saving costs. Furthermore, because the underlying storage layer strictly tracks the WAL as the definitive sequence of state over time, Neon supports instantaneous **Copy-on-Write Database Branching**. You can instantly "fork" a production database via API. Just like a Git branch, the new fork merely points to the existing storage history without duplicating it. #### 3. Google AlloyDB: HTAP and the Columnar Layer Standard Postgres stores data exclusively in a row-based format, which is excellent for transactional inserts (OLTP) but terribly inefficient for scanning large datasets (OLAP). While AlloyDB shares the decoupled storage philosophy, its crowning innovation is its **intelligent, self-managing caching hierarchy**. It uses active machine learning models to analyze query patterns in real-time, automatically identifying data frequently accessed for analytical reporting. It then fluidly converts those specific columns into a highly optimized, in-memory **Columnar Layer**, allowing AlloyDB to seamlessly handle Hybrid Transactional and Analytical Processing (HTAP) workloads without requiring manual ETL pipelines. By embracing the decoupling of the log from the heap, Postgres transitions from a single-node engine into a distributed architecture capable of independent scaling, instantaneous branching, and near-infinite durability. --- ## 8.3 - Table Partitioning (Splitting the Table) <img src="assets/arch_partitioning_diverse_v2_secretary_bird_1776815795478.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> As a table grows into the multi-terabyte range, maintenance operations like index rebuilding and `VACUUM` become increasingly expensive. **Declarative Partitioning** addresses these challenges by splitting a single logical table into multiple physical child tables. > [!TIP] > **The Search Speed Myth**: Partitioning is primarily a **Management** tool, not a performance booster for simple lookups. A well-designed B-Tree index is already logarithmic (`O(log n)`) and remains fast even at scale. Partitioning only helps if you can "Prune" or if the table is so massive that the maintenance of a single index causes critical resource exhaustion. ### The Partitioning Architecture Partitioning uses a "Parent" table as a logical routing layer. The actual data resides in smaller "Child" tables, or partitions. Postgres automatically routes inserts and queries to the correct partition based on the **Partition Key**. ### Partition Pruning The primary performance benefit of partitioning is **Partition Pruning**. The planner examines the constraints of each partition and immediately excludes those that cannot contain the requested data. This appears in the **[[Operations/_Operations|Execution Plan]]** as excluded subplans. Instead of scanning a massive unified index, the engine only initializes the indexes for the relevant partitions. ### Technical Constraints and Tradeoffs #### 1. Partition Key Selection Selecting an appropriate partition key is critical for performance and maintenance. - **Range Partitioning**: Often used for time-series data (`delivery_time`). It enables lock-free data archival via `ALTER TABLE ... DETACH PARTITION CONCURRENTLY`, allowing old data to be unhooked without blocking live inserts. - **List Partitioning**: Used for categorical data, such as `region_id` or `status`. - **Hash Partitioning**: Distributes rows across a fixed number of partitions to balance write load when natural range or list keys are absent. #### 2. The Fan-out Penalty If a query does not filter on the partition key, the engine must scan every partition. This "Fan-out" incurs overhead for initializing multiple plan nodes and can be slower than scanning a single large table. #### 3. Co-location of Joins Joining partitioned tables is efficient only if they share the same key and alignment. Misaligned partitions force cross-partition joins, which can degrade performance into a combinatorial bottleneck. #### 4. Indexing and Constraints Indexes in Postgres are local to each partition. There is no global index spanning all partitions. Crucially, any **Unique Constraint** (including Primary Keys) must include the partition key to ensure the engine can enforce uniqueness across the entire partitioned set. ### Implementation: Partitioning supply_deliveries We can migrate an existing table to a partitioned structure by creating a parent table and attaching child partitions. ```sql -- 1. Create the Parent Table (The Logical Router) CREATE TABLE supply_deliveries_partitioned ( id BIGINT GENERATED ALWAYS AS IDENTITY, supplier_id INT NOT NULL, ingredient_id INT NOT NULL, delivery_time TIMESTAMPTZ NOT NULL, quantity_kg NUMERIC(8,2) NOT NULL, PRIMARY KEY (id, delivery_time) ) PARTITION BY RANGE (delivery_time); -- 2. Create the Partitions (The Storage Children) CREATE TABLE supply_deliveries_2024 PARTITION OF supply_deliveries_partitioned FOR VALUES FROM ('2024-01-01') TO ('2025-01-01'); -- 3. Transition the data: The Swap BEGIN; ALTER TABLE supply_deliveries RENAME TO supply_deliveries_old; ALTER TABLE supply_deliveries_partitioned RENAME TO supply_deliveries; COMMIT; ``` Postgres routes records transparently to the appropriate partition based on `delivery_time`. This approach replaces a single massive index with smaller, more efficient ones. Maintenance operations like `VACUUM` are now localized, preventing background tasks from exhausting system-wide resources. --- ## 8.4 - Connection Pooling (The Waiting Room) <img src="assets/arch_crowded_hallway.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> When an application scales, the number of database connections can increase rapidly. While application servers often handle thousands of concurrent users, Postgres is constrained by its **Process-per-Connection architecture**. Unlike databases that use lightweight threads, Postgres performs a Unix `fork()` to create a dedicated OS process for every client. This design provides excellent memory isolation and crash containment. However, it imposes significant overhead in memory usage and process management during traffic spikes. > [!NOTE] > **In PostgreSQL Terms** > * **Connection**: A physical OS process spawned by Postgres to serve a client. > * **Pooler**: A proxy (like PgBouncer) that multiplexes many lightweight client connections onto a few heavy database connections. > * **ProcArray**: The shared memory structure tracking all active backends. --- ### The Process-per-Connection Cost When a client connects, the `postmaster` process creates a dedicated backend. Each backend consumes physical resources: 1. **Memory Overhead**: Every process requires private RAM for bookkeeping, sorting, and metadata. 2. **Fork Latency**: Creating a process is a kernel-level operation that takes milliseconds rather than microseconds. 3. **Context Switching**: High process counts force the OS to spend more resources swapping backends in and out of the CPU than executing queries. --- ### The ProcArray Bottleneck High connection churn β€” frequently opening and closing connections β€” can bottleneck the engine regardless of available RAM. Every backend must register itself in a global shared-memory list called the **`ProcArray`**. Postgres acquires an exclusive lock on this list whenever a connection state changes. Because every active query must read the `ProcArray` to build its MVCC visibility snapshot, connection churn stalls all running queries. The visibility logic becomes stuck behind the same global lock used for connection management. --- ### The Cost of Idleness The most resource-inefficient state for a Postgres server is having thousands of connections in the **`idle`** state. An idle connection performs no work but continues to hold its memory, its `ProcArray` slot, and a share of the `max_connections` limit. ```sql -- How many idle vs. active backends right now? SELECT state, count(*) FROM pg_stat_activity GROUP BY state; ``` The `idle in transaction` state is particularly hazardous. These backends hold open visibility snapshots that prevent `VACUUM` from reclaiming old tuples, leading to table bloat. ```sql -- Identify long-lived idle-in-transaction backends SELECT pid, state, now() - xact_start AS xact_duration, query FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY xact_duration DESC; ``` If 90% of your connections are `idle`, you're paying a tax in memory and coordination for processes doing nothing. --- ### The Proxy Layer: Connection Pooling To manage high connection counts, we use a **Connection Pooler** such as **PgBouncer**. The pooler acts as a multiplexer between application clients and the database. It maintains a lobby of client connections but only assigns them to a Postgres backend when a query is active. This decouples **Client State** from **Server State**, allowing thousands of idle clients to be served by a small number of backend processes. #### Pooling Modes | Mode | Behavior | Best For... | | :--- | :--- | :--- | | **Session** | The backend is assigned for the duration of the client session. | Apps requiring session-local state (e.g., temp tables). | | **Transaction** | The backend is assigned only for the duration of a transaction. | **Standard recommendation**. Maximizes connection reuse. | | **Statement** | The backend is assigned for a single statement. | High-volume workloads with zero session-state needs. | --- ### Layered Defense: Application vs. Proxy Professional systems use two layers of pooling to maintain stability: 1. **Application-Side Pooling**: Libraries like `HikariCP` maintain a small pool within your app instance. This eliminates the initial connection overhead for that specific instance. 2. **Proxy-Side Pooling**: A global proxy like **PgBouncer** sits in front of the database. It can accept 10,000 connections from your app nodes and map them to just 100 actual Postgres processes using **Transaction Pooling**. --- ### Recap > [!CAUTION] > **The Hazard of Hidden State: Prepared Statements** > > In `transaction` pooling mode, the pooler can swap your session to a different backend between queries, and any **session-local state** is lost. If you execute `PREPARE my_query AS ...` in one transaction and `EXECUTE my_query` in the next, you may receive a "prepared statement does not exist" error β€” the second transaction was routed to a different backend that was never taught your statement. > > To use Transaction Pooling safely, either: > 1. Use a pooler with statement tracking (e.g., Odyssey). > 2. Disable server-side prepared statements in your driver (e.g., `?prepareThreshold=0` in JDBC, or `pgbouncer=true` for some Node drivers). > [!NOTE] > - **Connections are expensive** because they are full OS processes. > - **Idle connections** are not free; they consume memory and `ProcArray` coordination. > - **`idle in transaction`** is worse: it pins snapshots and blocks Vacuum. > - **Transaction pooling** is the standard way to scale to thousands of application nodes. --- ## 8.5 - High Availability (Failover and Consensus) <img src="assets/arch_ha_governor.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Horizontal scaling solves for **Throughput** (Read Replicas) and **Management** (Partitioning). But it does not solve for **Durability of Service**. ### The Election (Automatic Failover) In a high-availability cluster, the Primary node is constantly sending a **Heartbeat** to its replicas and a distributed consensus store (typically `etcd`, `Consul`, or `ZooKeeper`). When the primary stops beating, the cluster goes through a deterministic sequence: 1. **Detection**: replicas and the orchestrator (e.g., **Patroni**) notice that the heartbeat lease in `etcd` has not been renewed within `ttl` seconds (default 30s). 2. **Liveness loss**: the consensus store revokes the primary's leader key. No node holds the leader role for the duration of the election. 3. **Candidate selection**: each replica reports its `pg_last_wal_replay_lsn()` to the orchestrator. The candidate with the highest replayed LSN is selected to be promoted; ties are broken by `priority` configuration. 4. **Fencing of the old primary**: before promoting, the orchestrator asserts that the old primary cannot still be writing (more on this in the next section). 5. **Promotion**: the chosen replica is sent `pg_promote()`, which writes a `XLOG_PROMOTE` record, exits recovery, and starts accepting writes. 6. **Routing**: the VIP, the load balancer, or the connection pooler is updated to send new traffic to the new primary. With **HAProxy** + a Patroni REST endpoint, this is automatic β€” HAProxy polls each candidate's `GET /primary` and only routes to the node that returns 200. 7. **Old primary, when it returns**, is restarted as a replica using `pg_rewind` to discard the unreplicated tail of its WAL and re-attach below the new primary. Most of this completes in 30–60 seconds for a well-tuned Patroni cluster. The three knobs that dominate that latency are the consensus store's lease TTL, the orchestrator's loop interval, and how aggressively the load balancer health-checks the candidates. > [!CAUTION] > **The HA Data-Loss Trap**: automated failover inherits the replication semantics from **[[Manuscript/08 - Distributed Scaling & Clouds/8.1.1 - synchronous_commit (The Durability Gradient)|Chapter 8.1.1]]**. With the default asynchronous replication and a hardware-loss event, the replica with the "highest LSN" is still missing the milliseconds of WAL the primary committed and `fsync`'d but had not yet shipped. Promoting that replica recovers service but **acknowledges the loss of those committed transactions**. If zero data loss is required, HA must be paired with `synchronous_commit = on` and `synchronous_standby_names`. Anything weaker is an availability trade β€” be explicit about the bound. ### The Terror of the Split-Brain The most dangerous scenario in a distributed database is a **Split-Brain**: a network partition splits the cluster into two halves, both halves declare themselves leader, and both accept writes. The cluster now holds two divergent histories that cannot be merged automatically β€” you have to pick one and manually replay or discard transactions from the other. The mechanism that prevents split-brain is **Fencing** the former primary before any new primary is promoted. There are three commonly-deployed flavors: | Flavor | How it works | Where you see it | | :--------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------- | | **STONITH** | Power-cycle or kernel-panic the old primary via IPMI / iLO / DRAC. Brutal but conclusive β€” if the box is off, it cannot accept writes. | Bare-metal HA, traditional pacemaker clusters | | **Storage fencing** | Revoke the old primary's lease on the underlying storage volume (Ceph RBD watch lock, AWS EBS detach, SAN reservation). | Cloud-storage HA, Patroni + Spilo on K8s | | **Application-side fencing** | The pooler or load balancer refuses to route traffic to any node that can't prove it holds the consensus-store leader key. | PgBouncer + Patroni REST; HAProxy + check | Patroni's default model is application-side fencing: it relies on **`pg_rewind` + the consensus key** to make a re-attached former primary safe. The promoted replica is the only node holding the `etcd` leader key; the former primary, even if it briefly returns thinking it's still in charge, finds its key gone, demotes itself, and waits to be re-attached. ### Quorum: The Rule of Three A consensus algorithm (Raft for `etcd`, ZAB for ZooKeeper) needs a strict majority β€” `floor(N/2) + 1` β€” to elect a leader and to commit a key. With `N=2`, a single network partition means *neither* half has a majority and the cluster halts. With `N=3`, one node can fail and the remaining two still form a majority. The practical implications: - **Three nodes minimum** for the consensus tier (`etcd`/`Consul`). Five if the cluster spans regions and you can afford the cross-region write latency. - **Postgres tier β‰  consensus tier**. The 3-of-5 etcd cluster does not have to match the topology of the Postgres cluster (which is often 1 primary + 2 replicas). - **Witness nodes** are a budget option: a third "witness" node runs only `etcd` (no Postgres) to break ties between two Postgres servers. This is a common pattern in two-data-center setups where running a third primary-capable node is too expensive. ### A Worked Topology A typical 3-region production deployment looks like this: ``` [ App tier, multiple AZs ] | [ HAProxy / VIP ] | +---------------------------+---------------------------+ | | | us-east-1a us-east-1b us-east-1c β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Patroni β”‚ β”‚ Patroni β”‚ β”‚ Patroni β”‚ β”‚ + Postgresβ”‚ ◄─async WAL──│ + Postgresβ”‚ ◄─async WAL──│ + Postgresβ”‚ β”‚ (Primary) β”‚ β”‚ (Replica) β”‚ β”‚ (Replica) β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ └───── etcd member ─────────┴────── etcd member β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ etcd member ``` `etcd` runs as a 3-node cluster co-located with the Patroni nodes, but with its own data directory and its own peer-to-peer port. HAProxy in front of the Postgres tier polls each Patroni's `GET /primary` HTTP endpoint every two seconds; only the node returning 200 gets traffic. When the AZ holding the primary loses power: 1. The other two `etcd` nodes still form a majority and revoke the primary's lease after the TTL (≀30s). 2. Patroni on the two surviving Postgres nodes compares LSNs over its REST API. 3. The candidate with the higher LSN runs `pg_promote()`. 4. HAProxy's next health check (within 2s) starts routing to the new primary. 5. When the dead AZ recovers, the former primary's Patroni notices its leader key is gone, runs `pg_rewind` against the new primary, and resumes life as a replica. End-to-end recovery time is dominated by the etcd TTL plus the load balancer's health-check interval β€” typically 30–60 seconds. By implementing this stack, you move from a brittle system dependent on a single physical server to a resilient, distributed architecture that survives the total loss of a node β€” and, with care, an entire data center β€” through automated coordination and consensus. --- ## 8.6 - Summary (Distributed Scaling & Clouds) > Scale is the deliberate introduction of distance. To break the limits of a single machine, you must accept the latency of coordination. You can duplicate the data or you can divide it, but you cannot outrun the speed of light. <div style="page-break-after: always;"></div>