3.2.3 - Aggregations - Aesa's Notes

# 3.2.3 Aggregations ![The Prep Station](assets/arch_prep_station.png) Once raw suitcases have been fetched and joined, the final stage of the Query Algebra is to summarize and sort the resulting stream. In Postgres, this is the responsibility of the **Aggregate Nodes**. These nodes are the "summarizers" of the database—responsible for the `SUM`, `COUNT`, and `GROUP BY` logic. They work by iteratively applying a **Transition Function ($S_{func}$)** to a **Transition State ($S_{trans}$)** in memory until every tuple in the group has been processed. Let's see how the Prep Station squashes our Cafe's data for the Maitre D'. #### 1. HashAggregate: The Memory-Intensive Lookup If the incoming data is unsorted, Postgres utilizes a **HashAggregate** node. It builds a **Hash Table** in memory where every unique grouping key becomes a bucket. Each bucket stores the current **Transition State** for that group. As suitcases stream through, $S_{func}$ is executed to update the state in the corresponding bucket. ```sql -- Squashing 10 million individual deliveries into totals per ingredient EXPLAIN SELECT ingredient_id, sum(quantity_kg) FROM supply_deliveries GROUP BY ingredient_id; -- Result (The Bucket Toss): -- HashAggregate (cost=105.30..106.55 rows=100 width=36) -- Group Key: ingredient_id ``` > [!TIP] > **The Memory Limit**: The success of a HashAggregate depends entirely on **[[Chapter 5/5.3 - The Private Desk (Work Mem)|work_mem]]**. If the number of unique groups is so large that the Hash Table exceeds the allocated memory, the meticulous database engine must "spill to disk"—writing intermediate buckets to the Frozen Pantry, which dramatically increases execution time. #### 2. GroupAggregate: The Streaming Summary If the data arrives already sorted (either via a previous Sort node or an Index Scan), Postgres utilizes a **GroupAggregate**. This node is far more efficient than a HashAggregate because it only needs to keep track of the *current* group. When the value of the grouping column changes, it knows the previous group is finished and can emit the result immediately. ```sql -- The ingredients arrive already prepped (sorted) CREATE INDEX idx_supply_ingredient ON supply_deliveries(ingredient_id); EXPLAIN SELECT ingredient_id, sum(quantity_kg) FROM supply_deliveries GROUP BY ingredient_id; -- Result (The Clicker Pass): -- GroupAggregate (cost=0.28..110.19 rows=100 width=36) -- Group Key: ingredient_id ``` > [!TIP] > **Streaming Efficiency**: A **GroupAggregate** is the ultimate expression of the Lazy Elephant. It doesn't need to build a massive table in memory; it simply watches the sorted stream pass by, incrementing a scalar counter. It is a "streaming" operator that consumes almost no memory. #### 3. The Execution Stop (Limit) This is Postgres's most effective method for resource conservation. If the query only requires a subset of the data, the **Limit** node halts the upward flow of tuples once the threshold is met. ```sql EXPLAIN SELECT * FROM supply_deliveries LIMIT 1; -- Result: -- Limit (cost=0.00..0.02 rows=1 width=44) -- -> Seq Scan on supply_deliveries (cost=0.00..1845.00 rows=10000000) ``` Notice the **Estimated Cost**. Even though the child node (the Sequential Scan) has a total cost of 1,845 points, the **Limit** node knows it only needs the first tuple. Because of the **Volcano Model**, Postgres pulls exactly one record and then terminates the downstream requests, resulting in a negligible execution cost. --- | ← Previous | ↑ Table of Contents | Next → | | :--- | :---: | ---: | | [[Chapter 3/3.2.2 - Joins\|3.2.2 The Matchmakers (Joins)]] | [[Learn You a Postgres for Great Good\|Home]] | [[Chapter 3/3.3 - The Art of Not Opening Every Box (Sargability)\|3.3 The Art of Not Opening Every Box (Sargability)]] |