MergeJoin - Aesa's Notes

> [!NOTE] Merge Join > <table> > <tr> > <td width="25%"><img src="assets/ex_mergejoin.png"></td> > <td>A join strategy used when both input sets are already sorted by the join keys. The engine steps through both relations simultaneously, like merging two sorted lists. Efficient for large joins where sorting is already performed or required for the final output.</td> > </tr> > </table> > > ```sql > -- Joining two tables using pre-sorted indexes > SET enable_hashjoin = off; > SET enable_nestloop = off; > > EXPLAIN (ANALYZE, COSTS, BUFFERS, VERBOSE) > SELECT * FROM animals a > JOIN species s ON a.species_id = s.id > ORDER BY a.species_id; > ``` > > ```text > Merge Join (cost=1.42..1224.87 rows=20000 width=42) (actual time=1.010..3.539 rows=20000 loops=1) > Output: a.id, a.name, a.species_id, a.created_at, s.id, s.name, s.diet_type > Inner Unique: true > Merge Cond: (a.species_id = s.id) > Buffers: shared hit=763 > -> Index Scan using idx_animals_species_id on public.animals a (...) > -> Index Scan using species_pkey on public.species s (...) > ``` > > <table> > <tr> > <td rowspan="5" width="25%"><img src="assets/ex_merge_semi_join.svg"></td> > <td><b>Performance</b></td><td>High performance for pre-sorted inputs or when indexes cover the join keys.</td> > </tr> > <tr><td><b>Factors</b></td><td>Size of input datasets, existence of supporting indexes, and CPU costs for sorting.</td></tr> > <tr><td><b>Cost</b></td><td><code>sort cost + cpu_operator_cost * rows joined</code></td></tr> > <tr><td><b>Operates on</b></td><td><a href="Structures/Result Set">Result Set</a></td></tr> > <tr><td><b>Workloads</b></td><td><a href="Workloads/IO/DataFile/DataFileRead">IO: DataFileRead</a>, <a href="Workloads/IO/BufFile/BufFileRead">IO: BufFileRead</a>, <a href="Workloads/IO/BufFile/BufFileWrite">IO: BufFileWrite</a>, <a href="Workloads/LWLock/Buffers/BufferContent">LWLock: BufferContent</a></td></tr> > <tr><td colspan="3"><b>Description</b>: Joins tables using merge join algorithm.</td></tr> > </table>