01 - Foundations & Data Modeling

# Chapter 1: Foundations & Data Modeling ## 1.0 - Relations & Normalization (The Cafe Layout) <img src="assets/elephant_cafe_architecture_layout.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> Postgres manages a structured collection of **Facts** about its world. In the eyes of the engine, every record is a fact: - "Babu is a Capybara" is a fact. - "Saffron costs $400 per kg" is a fact. - "Supplier #5 delivered 50kg of Apples" is a fact. A collection of facts sharing the same shape is a **Relation** (commonly a **Table**). Unlike a simple list, a relation is a mathematical set where every record—or **Tuple**—must be unique and identifiable. This logic relies on two core disciplines: the **Relationship** and **Normalization**. ### The Discipline of Normalization If we were setting up the Cafe in a simple spreadsheet, we might be tempted to record every detail about an animal in a single, massive row. This is the **Naive Model**. | id | name | species | diet | | :-- | :--- | :------- | :-------- | | 101 | Babu | Capybara | Herbivore | | 102 | Pip | Capybara | Herbivore | | 103 | ... | ... | ... | Storing descriptive strings like "Herbivore" redundantly in every record creates a **data integrity risk**. If you rename the category, you must update every copy; missing even one creates an **Update Anomaly** where the database disagrees with itself. Postgres avoids this through **Normalization**: the process of storing every fact in exactly **one place**. Instead of duplicating strings, we store them once in a central table and use tiny, 4-byte **Foreign Key** pointers to link records to that single source of truth. #### The Anatomy of the Link Observe how a simple integer placeholder spares the engine from redundancy: **The `species` Ledger (Fact Blueprint)** | id | name | diet_type | | :---- | :------- | :---------- | | **1** | Capybara | Herbivore | | **2** | Aardvark | Insectivore | **The `animals` Ledger (Set of Facts)** | id | name | species_id | | :-- | :----- | :--------- | | 101 | Babu | 1 | | 102 | Pip | 1 | | 103 | Arthur | 2 | When Babu walks in, Postgres uses the **`species_id`** (1) to perform a **Join** against the species table, reconstructing his profile in memory. The fact that Babu is a Capybara is not stored twice; it is **linked** once. This is the heart of the **Relational** model: it guarantees integrity by ensuring there is only ever one version of the truth, and it minimizes *write* I/O by avoiding massive updates to duplicate strings. The trade-off is that it can increase *read* I/O, since the engine must now traverse and join multiple physical tables across separate files instead of scanning a single, flat sheet. But in the long run, saving the engine from contradictory updates is worth the price of a join. > [!TIP] > The mathematical machinery behind this model — why a table is a **set of typed tuples**, why SQL can be rearranged freely by the planner is the subject of **[[Manuscript/01 - Foundations & Data Modeling/1.2 - Relational Model (An academic detour)|1.2 An academic detour]]**. --- #### The Blueprint of the Cafe Before a single order can be processed, Postgres must define the schema of its world using **DDL (Data Definition Language)**. These blueprints — `CREATE TABLE` statements — are the strict contracts that every piece of data must conform to. > [!NOTE] > You can find the full, literal blueprints for the Elephant Cafe in the **[[scripts/init.sql|Architectural Inventory]]**. ```mermaid erDiagram species ||--o{ animals : "" suppliers ||--o{ supply_deliveries : "" ingredients ||--o{ supply_deliveries : "" ingredients ||--|| flavors : "" animals ||--o{ animal_favorites : "" ingredients ||--o{ animal_favorites : "" dishes ||--o{ dish_ingredients : "" ingredients ||--o{ dish_ingredients : "" animals ||--o{ orders : "" orders ||--o{ order_items : "" dishes ||--o{ order_items : "" ``` --- --- ### The Three Tiers of Truth (Normal Forms) The relational model defines a hierarchy of **Normal Forms**—a set of rules for moving information from the user's mind into the database's physical structure. | Normal Form | Rule | Example Violation | | :--- | :--- | :--- | | **1NF: Atomic Facts** | No nested arrays or comma lists | `diet = 'Herbivore, Omnivore'` | | **2NF: Anchored Truth** | No partial dependency | Storing `item_supplier` in `orders` | While 1NF and 2NF ensure basic structural integrity, the most important discipline for a production system is the third. #### 3NF: The Source of Truth **Third Normal Form (3NF)** eliminates the **Transitive Chain**. If we store "Herbivore" inside the `animals` table, we have to carry that context forever. If we move it to a `species` table and link to it, the diet of an animal **follows from** its species membership. The schema itself now "knows" the business rule. By keeping species attributes in `species` and pointing to them via foreign key, the Cafe ensures that a single update to a species propagates instantly across every animal record. Normalization is not about academic purity; it is an architectural decision to ensure that the engine never has to look in two places for the same truth. --- --- ## 1.1 - SQL (The Waiter's Pad) <img src="assets/arch_sql_waiter_pad.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> When you interact with Postgres, you send it a command like this: ```sql SELECT * FROM animals WHERE species_id = 1; ``` Notice what is missing from that command. You do not specify which file to open on the disk. You do not tell the engine what byte offset to seek, or how to iterate through memory buffers. This is because **SQL (Structured Query Language)** is a **declarative** language. In a declarative language, you describe the *result* you want, rather than the *steps* required to get it. You provide the requirement, and the database engine takes total responsibility for the physical labor. > [!IMPORTANT] > **The Declarative Checkpoint**: If you remember one thing about SQL, let it be this: **You describe the 'What,' not the 'How.'** ### The Waiter's Pad Think of SQL like a **Waiter's Pad**. You describe what you want to eat. You do not tell the chef which pan to use or how hot the stove should be. This separation of concerns is the foundation of database performance. Because you only provide a logical request, the engine's **Query Planner** is granted the freedom to find the cheapest physical path. It can choose to scan every page sequentially, or it can use an index to jump directly to the answer. ### The Three Vocabularies (DDL, DQL, DML) This logical interface is divided into three functional areas: | Area | Purpose | Key Commands | Architectural Payoff | | :--- | :--- | :--- | :--- | | **DDL** (Definition) | Define structure. | `CREATE`, `ALTER`, `DROP` | Allocates physical files and catalog entries. | | **DQL** (Query) | Read state. | `SELECT` | Optimizer decides between Index or Seq Scans. | | **DML** (Manipulation) | Modify state. | `INSERT`, `UPDATE`, `DELETE` | Manages tuple versioning and MVCC headers. | SQL is the **Logical Interface**—a rigorous layer that separates your request from the physical reality of how the engine delivers it. --- ## 1.2 - Relational Model (An academic detour) <img src="assets/arch_relational_model_set.png" width="250" style="float: left; margin: 0 20px 20px 0;" /> SQL is not the foundation of Postgres. It is a surface dialect — a human-readable layer resting on top of a much older, more rigorous framework called **Relational Algebra**. Understanding the difference between the two changes how you read query plans. This algebraic foundation is what allows Postgres to transform from a simple storage engine into a high-performance query optimizer. The core insight is simple: if you define a database table not as a spreadsheet, but as a **mathematical set of typed tuples**, you gain a clean set of operations that compose predictably. The engine can then rearrange those operations with significant freedom, provided the result is algebraically equivalent. Most of the Query Planner's "magic" is just this rule applied recursively. --- ### What a Relation Actually Is A **Relation** is a set of tuples sharing a common attribute header. This sounds like a table, but two important constraints fall out of the mathematical definition that distinguish it from a spreadsheet: **1. No guaranteed ordering.** A set has no inherent sequence. Postgres does not promise that rows come back in the order they were inserted, or in any order at all, unless you explicitly ask for one with `ORDER BY`. This surprises a surprising number of engineers in production. ```sql -- This returns rows in no guaranteed order. SELECT * FROM animals; -- This returns rows in a guaranteed order. SELECT * FROM animals ORDER BY name; ``` **2. No duplicate rows (by default, unenforced).** A true mathematical set prohibits duplicate elements. Postgres does not enforce this constraint automatically — you can insert two identical rows — but a `UNIQUE` or `PRIMARY KEY` constraint brings a table into formal compliance. The decision to not enforce uniqueness by default was deliberate: enforcing it has a cost, and not every table needs it. --- ### The Five Primitive Operations Most common `SELECT` statements can be decomposed into five fundamental algebraic operations. These five are sufficient to express nearly any query over a relational schema: | Symbol | Operation | What It Does | SQL Equivalent | | :----- | :------------- | :--------------------------------------------------- | :--------------------------------------- | | `σ` | **Selection** | Filter the set — keep only tuples matching predicate | `WHERE diet_type = 'Herbivore'` | | `π` | **Projection** | Trim the header — keep only specified attributes | `SELECT name, diet_type` | | `⋈` | **Join** | Combine two relations on a shared attribute | `JOIN species ON animals.species_id = species.id` | | `∪` | **Union** | Merge two compatible sets, remove duplicates | `UNION` | | `−` | **Difference** | Return tuples in the first set absent from the second | `EXCEPT` | Applied to the Elephant Cafe: - **Selection (`σ`)**: "Only the Herbivores" (`WHERE diet_type = 'Herbivore'`) - **Projection (`π`)**: "Only the names" (`SELECT name`) - **Join (`⋈`)**: "Match animals to their species" (`JOIN species ON ...`) Under the hood, Postgres sees your query as a composition of these operators. Because the algebra is **closed**, the output of one operator is typically a valid input for the next. --- ### Why Composability Matters: The Planner's Freedom Consider this query, which looks for every order placed by a specific elephant: ```sql SELECT * FROM orders JOIN animals ON orders.animal_id = animals.id WHERE animals.name = 'Babu'; ``` If Postgres executed this query precisely as written, the performance could be catastrophic for large datasets: it would join millions of `orders` to thousands of `animals`, and only *then* filter for Babu. Instead, the planner applies an algebraic identity called **predicate pushdown**. It recognizes that the filter can be applied to the `animals` table *before* the join, reducing the work to a single row lookup. This is made possible by **Closure**—the defining property of Relational Algebra where every operation takes a relation as input and returns a relation as output. Because every operator speaks the same language, they are freely composable. > [!NOTE] The Click > **Concept**: SQL is declarative. You write *what* you want; Postgres determines *how* to get it. > **Payoff**: Because relational operations obey the law of closure, the query planner can treat your query as a flexible execution tree rather than a rigid list of step-by-step instructions. It is free to push filters down, change join orders, and swap scan strategies, guaranteeing the exact same result set while executing it thousands of times faster. --- ## 1.3 - Summary (Foundations & Data Modeling) > SQL is a declarative boundary. You describe the facts; the engine determines the physics of retrieval. Normalization ensures the engine never has to negotiate with contradictions. State the truth once, and let the database find it. <div style="page-break-after: always;"></div>