# Chapter 1: Foundations & Data Modeling ## 1.0 Relations & Normalization ![[assets/elephant_cafe_architecture_layout.png|450]] Postgres is fundamentally meticulous, and it achieves this through rigorous organization. It does not just store "data"; it manages a structured collection of facts about its world. Before we dive into the internal mechanics of Postgres, we must understand why it is defined as a **Relational** Database. This logic relies on two core disciplines: the **Relationship** and **Normalization**. ### Facts and Relations In the eyes of the meticulous database engine, every record in your database is a **Fact**. - "Babu is a Capybara" is a fact. - "Saffron costs $400 per kg" is a fact. - "Supplier #5 delivered 50kg of Apples" is a fact. A collection of related facts that all share the same shape (or schema) is called a **Relation**. In common parlance, we call this a **Table**. A relation is not just a list; it is a mathematical set. Every "fact" (or **Tuple**) within that set must be unique and identifiable. ### The Discipline of Normalization If we were setting up the Cafe in a simple spreadsheet, we might be tempted to record every detail about an animal in a single, massive row. This is the **Naive Model**. | id | name | species | diet | | :-- | :--- | :------- | :-------- | | 101 | Babu | Capybara | Herbivore | | 102 | Pip | Capybara | Herbivore | | 103 | ... | ... | ... | Storing the string "Herbivore" ten thousand times is more than just a waste of space—it is a **Truth Hazard**. If we decide to rename "Herbivore" to "Plant-Based," and the word is redundantly copied into every single animal's record, Postgres must perform a massive, exhausting search to find and update every copy. If it misses even one, the database has committed the ultimate sin: it now disagrees with itself. In technical terms, we have created an **Update Anomaly**. Postgres avoids this through **Normalization**—the calm, methodical process of ensuring that every fact is stored in exactly **one place**. Instead of copying "Herbivore" into every animal record, the wise database engine stores it once in a central **`species`** ledger. Every animal record in the **`animals`** ledger then carries a tiny, 4-byte pointer—a **Foreign Key**—linking it back to that single source of truth. #### The Anatomy of the Link Observe how a simple integer placeholder spares the engine from redundancy: **The `species` Ledger (Fact Blueprint)** | id | name | diet_type | | :---- | :------- | :---------- | | **1** | Capybara | Herbivore | | **2** | Aardvark | Insectivore | **The `animals` Ledger (Set of Facts)** | id | name | species_id | | :-- | :----- | :--------- | | 101 | Babu | 1 | | 102 | Pip | 1 | | 103 | Arthur | 2 | When Babu walks in, Postgres uses the **`species_id`** (1) to perform a **Join** against the species table, reconstructing his profile in memory. The fact that Babu is a Capybara is not stored twice; it is **linked** once. This is the heart of the **Relational** model: it minimizes I/O overhead and ensures that there is only ever one version of the truth. > [!TIP] > The mathematical machinery behind this model — why a table is a **set of typed tuples**, why SQL can be rearranged freely by the planner — is the subject of **[[Manuscript/01 - Foundations & Data Modeling/1.2 - Relational Model (The Academic Foundation)|1.2 The Relational Model]]**. It is the academic foundation of our Cafe. --- #### The Blueprint of the Cafe Before a single order can be processed, Postgres must define the schema of its world using **DDL (Data Definition Language)**. These blueprints — `CREATE TABLE` statements — are the rigid contracts that every piece of data must conform to. > [!NOTE] > You can find the full, literal blueprints for the Elephant Cafe in the **[[scripts/init.sql|Architectural Inventory]]**. ```mermaid erDiagram species ||--o{ animals : "" suppliers ||--o{ supply_deliveries : "" ingredients ||--o{ supply_deliveries : "" ingredients ||--|| flavors : "" animals ||--o{ animal_favorites : "" ingredients ||--o{ animal_favorites : "" dishes ||--o{ dish_ingredients : "" ingredients ||--o{ dish_ingredients : "" animals ||--o{ orders : "" orders ||--o{ order_items : "" dishes ||--o{ order_items : "" ``` --- ### The Normalized Tiers of the Cafe We can divide our Cafe's data model into three distinct logical areas, each serving a specific architectural purpose. ##### 1. The Core Entities (`species` & `animals`) This is a standard **normalized hierarchy**. Species-level attributes are stored once, and individual patron records point back to them. Changing a species name here propagates to every animal instantly, without requiring a single row migration in the `animals` table. ##### 2. The High-Precision Pantry (`ingredients` & `flavors`) Here, we store the "Scent Science." Ingredients are linked to high-dimensional **Flavor Vectors** and scent notes. This is where we will exercise **[[Manuscript/03 - Access Paths & Indexing/3.4 - Vector Search (The Scent Tracker)|Vector Indexes]]** and complex metadata linking. ##### 3. The Transactional Hustle (`orders`, `dishes`, & `supply_deliveries`) These tables capture the movement of data—the **Facts of Action**. - **Many-to-Many Links**: Tables like `dish_ingredients` and `order_items` use pairs of foreign keys to link two disparate facts together (e.g., "This Order includes this Dish"). - **The Infinite Ledger**: `supply_deliveries` records every shipment. This table will grow to billions of rows, making it our primary candidate for **[[Manuscript/08 - Distributed Scaling & Clouds/8.3 - Table Partitioning (Dividing the Dining Room)|Table Partitioning]]**. --- In the next section, we will look at **[[Manuscript/01 - Foundations & Data Modeling/1.1 - SQL (The Language of the Cafe)|SQL]]** — the dialect Postgres speaks, and why it is deliberately vague about physical details. --- ## 1.1 The Language of the Cafe (SQL) ![[assets/arch_sql_waiter_pad.png|450]] When you talk to Postgres, you do not manage the physical storage details manually. You don't tell the database which file to open, what byte offset to seek to, or how to iterate through a specific buffer in memory. Managing these low-level operations directly would be inefficient and difficult to optimize. Instead, you use a declarative interface called **SQL (Structured Query Language)**. As a **Declarative** language, SQL allows you to describe the *result set* you require—for example, "Bring me all Capybaras that ordered Peanuts"—without specifying the *execution path* required to find them. SQL acts as a high-level abstraction layer—the **Waiter’s Pad**. By submitting a declarative request, you grant the **Query Planner** (the Query Optimizer) the freedom to analyze statistics and rearrange the physical execution steps to find the most efficient (lowest cost) path to your data. > **Why the Waiter's Pad?** > Think of it like visiting a restaurant: you tell the waiter **what** you want to eat (a Capybara Salad), not **how** to specifically prepare it (the exact heat of the pan, the order of the ingredients, or which knife to use). You provide the specification; the kitchen manages the execution. > > By delegating the "how" to the Query Optimizer (the **Query Planner**), you trust their professional expertise to pick the most efficient preparation method based on what is currently in the pantry. If the kitchen has a pre-chopped container of ingredients ready (an **Index Scan**), the Chef will use it; if they have to search the entire pantry for the right tomato (a **Sequential Scan**), they will do that too. You grant Postgres this freedom so it can optimize for performance without you needing to rewrite your order every time the pantry changes. > > If you were to specify exactly which index to use or which join algorithm to apply (an **Imperative** approach), Postgres would be unable to optimize its execution based on current table statistics. By using SQL, you abstract away the complexities of **buffer management**, **disk I/O**, and **concurrency control**, delegating the algebraic optimization to the Postgres planner. ### The Three Dialects of the Cafe The dialect Postgres speaks is divided into three primary functional areas, depending on whether you are defining schema, retrieving data, or modifying state. ### Defining the Schema (DDL) **Data Definition Language** is used to define the schema—the persistent structure of the database. Before any data can be inserted, Postgres must validate the schema definitions (the blueprints). - **`CREATE TABLE`**: "Build a new pantry section for Spices." (Initializes a new relation and its physical storage files). - **`ALTER TABLE`**: "Add a shelf for Saffron to the Spice section." (Modifies the metadata of an existing relation). - **`DROP TABLE`**: "Tear down the entire Vegetable section. We are a carnivore cafe now!" (Destroys the relation and unlinks its physical storage). ### Reading Data (DQL) **Data Query Language** is used for **reading** from the database—querying Postgres about the current state of the system. - **`SELECT`**: "Retrieve all 'Herbivore' records associated with the 'North Wing'." ```sql -- Asking for our favorite regular SELECT * FROM animals WHERE name = 'Babu'; ``` - **The Declarative Contract**: When you submit a `SELECT` statement, you are describing the *desired output*, not the *procedural steps* required to fetch it. Compare these two ways of asking for data: | The "Procedural" Chef (Imperative) | The "Meticulous" Database (SQL) | | :--- | :--- | | **Walk** to the North Wing. | `SELECT * FROM animals` | | **Look** at every guest. | `WHERE wing = 'North'` | | **If** they are an Herbivore, **Keep** them. | `AND diet = 'Herbivore';` | In the procedural version, you must manage memory pointers and iteration logic manually. In the SQL version, you provide a high-level description, and the Postgres cost-based optimizer decides whether to perform a **Sequential Scan** or utilize a **[[Manuscript/03 - Access Paths & Indexing/3.1 - B-Tree (The Balanced Bookshelf)|Index Scan]]**. ### Writing Data (DML) **Data Manipulation Language** is used for **writing** to the database—asking Postgres to manipulate the data in its underlying physical storage. - **`INSERT`**: "Deliver a new crate to the Loading Dock." (Creates a new row/tuple and persists it to a data page). - **`UPDATE`**: "Cross out the old manifest and write a fresh one elsewhere." (Adheres to the **MVCC** protocol—it marks the old tuple as obsolete and writes a new version to a fresh location). - **`DELETE`**: "Mark the crate as spoiled and hide it from view." (Marks a tuple as invisible to future transactions, leaving it for eventual cleanup by the **Autovacuum**). SQL serves as the **Logical Interface** for the database. When you execute a query, you are abstracting away the **[[Manuscript/04 - Query Planning & Execution/4.1 - The Query Optimizer's Menu (Query Planning)|Execution Plan]]**. It is a rigorous, mathematical layer that separates your application logic from the physical reality of bytes on disk. --- ## 1.2 The Relational Model ![[assets/arch_relational_model_set.png|450]] SQL is not the foundation of Postgres. It is a surface dialect — a human-readable layer resting on top of a much older, more rigorous framework called **Relational Algebra**. Understanding the difference between the two changes how you read query plans. The core insight is simple: if you define a database table not as a spreadsheet, but as a **mathematical set of typed tuples**, you gain a clean set of operations that compose predictably. The engine can then rearrange those operations freely, provided the result is algebraically equivalent. Most of the Query Planner's "magic" is just this rule applied recursively. --- ### What a Relation Actually Is A **Relation** is a set of tuples sharing a common attribute header. This sounds like a table, but two important constraints fall out of the mathematical definition that distinguish it from a spreadsheet: **1. No guaranteed ordering.** A set has no inherent sequence. Postgres does not promise that rows come back in the order they were inserted, or in any order at all, unless you explicitly ask for one with `ORDER BY`. This surprises a surprising number of engineers in production. ```sql -- This returns rows in no guaranteed order. SELECT * FROM animals; -- This returns rows in a guaranteed order. SELECT * FROM animals ORDER BY name; ``` **2. No duplicate rows (by default, unenforced).** A true mathematical set prohibits duplicate elements. Postgres does not enforce this constraint automatically — you can insert two identical rows — but a `UNIQUE` or `PRIMARY KEY` constraint brings a table into formal compliance. The decision to not enforce uniqueness by default was deliberate: enforcing it has a cost, and not every table needs it. --- ### The Five Primitive Operations Every `SELECT` statement you have ever written decomposes into exactly five algebraic operations. These five are sufficient to express any query over a relational schema: | Symbol | Operation | What It Does | SQL Equivalent | | :----- | :------------- | :--------------------------------------------------- | :--------------------------------------- | | `σ` | **Selection** | Filter the set — keep only tuples matching predicate | `WHERE diet_type = 'Herbivore'` | | `π` | **Projection** | Trim the header — keep only specified attributes | `SELECT name, diet_type` | | `⋈` | **Join** | Combine two relations on a shared attribute | `JOIN species ON animals.species_id = species.id` | | `∪` | **Union** | Merge two compatible sets, remove duplicates | `UNION` | | `−` | **Difference** | Return tuples in the first set absent from the second | `EXCEPT` | Applied to the Elephant Cafe: ```sql -- A Selection (σ) followed by a Projection (π): -- "Give me only the names of Herbivores." SELECT name -- π (project onto 'name') FROM animals JOIN species ON animals.species_id = species.id -- ⋈ (join) WHERE species.diet_type = 'Herbivore'; -- σ (select) ``` Under the hood, Postgres sees this as a composition of `σ ∘ ⋈` applied to the two base relations `animals` and `species`, then projected onto `π(name)`. --- ### Why Composability Matters: The Planner's Freedom Consider a world where this property did not exist. Suppose filtering a table returned something that was *not* a table — maybe a list, or a stream, or some other structure. In that world, you could not take the result of a `WHERE` and feed it into a `JOIN`, because the types would not match. Every operation would need to know about every other operation's output format. The planner could not rearrange anything, because rearranging would break the pipeline. You would be stuck executing queries in exactly the order you wrote them. This is, in fact, how most imperative data processing works. And it is precisely the limitation that relational algebra was designed to eliminate. ### Closure: The One Rule That Makes Everything Else Possible The critical property is **closure**: every relational operation takes a Relation as input and returns a Relation as output. Not a different kind of thing. The same kind of thing — a set of typed tuples with a header. In programming or mathematics, a set of operations is **closed** under a type if applying them to values of that type always results in a value of the same type. Think of it as a function that promised never to leave the building. If you add two integers, the result is still an integer; you haven't accidentally produced a string or a piece of fruit. Because every relational operation is closed under the "Relation" type, they are **freely composable**. You are not building a pipeline of different materials that might leak or break; you are building a circuit where every component speaks the exact same voltage. The output of any operation is always a valid input to any other. This uniformity is enforced across the entire algebraic catalog: - A `Selection` (`σ`) returns a Relation. - A `Projection` (`π`) returns a Relation. - A `Join` (`⋈`) returns a Relation. - A `Union` (`∪`) returns a Relation. ### What the Planner Does With This Freedom Because the algebra is closed, the **Query Planner** (covered in detail in **[[Manuscript/04 - Query Planning & Execution/4.1 - The Query Optimizer's Menu (Query Planning)|Chapter 4]]**) can apply algebraic identities to rewrite your query into a cheaper equivalent. The most common rewrite is **predicate pushdown**: moving a filter earlier in the pipeline so it reduces the data volume before an expensive operation. Consider this query: ```sql -- What you wrote: SELECT * FROM orders JOIN animals ON orders.animal_id = animals.id WHERE animals.name = 'Babu'; ``` As written, this says: "Join the entire `orders` table to the entire `animals` table, then filter the result to just Babu." If `orders` has 10 million rows and `animals` has 50,000 rows, that join produces an enormous intermediate result — most of which is immediately thrown away by the `WHERE`. The planner sees the algebraic equivalence: ``` σ(⋈(orders, animals)) = ⋈(orders, σ(animals)) ``` The filter `name = 'Babu'` depends only on `animals`, so the planner can legally push it *before* the join. Now Postgres filters `animals` down to a single row first, then joins that one row against `orders`. The result is mathematically identical. The cost is dramatically lower. ```sql -- What the planner actually executes: -- 1. Filter animals to just 'Babu' (1 row) -- 2. Join that 1 row against orders -- The result is identical. The work is not. ``` This is not a special optimization. It is a direct consequence of closure. Every rewrite the planner applies — predicate pushdown, join reordering, projection pruning — is just an algebraic identity that happens to be cheaper to execute. > [!NOTE] > **You are not writing instructions. You are writing a specification.** > SQL is declarative precisely *because* relational algebra is closed. Postgres is free to completely ignore the order in which you wrote your clauses and substitute any algebraically equivalent plan. Whether it uses a Sequential Scan, an Index Scan, or a Hash Join is determined by cost estimates, not by your syntax. This is the contract the relational model guarantees. Postgres honors it. But for the Query Planner to rewrite your world with such confidence, it must first be certain that the truth is stored in only one place. --- ### A Note on Normalization: From Context to Structural Truth The relational model defines a hierarchy of **Normal Forms**—a set of rules for moving information from the user's mind into the database's physical structure. In a de-normalized system, you rely on **Outside Context**. You have to "know" that all Capybaras share a diet, and you have to "remember" to update every animal record if that diet changes. In a normalized system, that knowledge is baked into the **Shape** of the data. The truth no longer requires context; it **follows from** the relationship itself. Think of it as the weight of your luggage. "Context" is heavy; it requires carrying redundant facts in every physical **[[Manuscript/02 - Physical Storage & MVCC/2.2 - Tuple (The Physical Suitcase)|Suitcase]]**. "Structural Truth" is light; it uses a single, 4-byte link to ensure that every suitcase points to a single source of truth. ### 1NF: The Rule of Atomic Truth In **First Normal Form (1NF)**, every cell must contain exactly one "Atomic" fact. If you pack a comma-separated list into a column, you have created a **Tangled Suitcase**. A 1NF violation is purely structural. You don't need to know what the data *means* to see that it's a mess. The rule **follows from** the mathematical definition of a set: every element must be distinct and well-defined. ```mermaid graph TD subgraph "Violation: The Tangled Suitcase" V1[Animal: Babu / Diet: 'Herbivore, Omnivore'] end subgraph "1NF: Atomic Facts" A1[Animal: Babu / Diet: 'Herbivore'] A2[Animal: Babu / Diet: 'Omnivore'] end V1 -- "Unpacking" --> A1 V1 -- "Unpacking" --> A2 ``` ### 2NF: The Anchor Rule **Second Normal Form (2NF)** moves us from structure to logic. It requires **Functional Dependency**. To spot a 2NF violation, you need a sliver of "Outside Context"—you need to know that a Supplier belongs to an Item, not to an individual Order. By normalizing to 2NF, we "anchor" that fact to the correct entity. Now, the supplier's contact info **follows from** the Item ID, rather than being a redundant "Shadow Fact" that must be updated in thousands of different orders. ```mermaid graph TD subgraph "Violation: The Partial Fact" K["Key: Order #101 + Item 'Apple'"] D1[Price: $3.00] S1[Supplier: 'Old McDonald'] K --> D1 K --> S1 Note["Supplier depends only on Item, not Order!"] end subgraph "2NF: Anchored Truth" K2["Key: Item 'Apple'"] S2[Supplier: 'Old McDonald'] K2 --> S2 end ``` ### 3NF: The Source of Truth **Third Normal Form (3NF)** eliminates the **Transitive Chain**. This is where we remove the final need for outside context. If we store "Herbivore" inside the animal's suitcase, we have to carry that context forever. If we move it to a `species` table and link to it, the diet of an animal **follows from** its species membership. The schema itself now "knows" the business rule, and Postgres enforces it for us. ```mermaid graph LR subgraph "Violation: The Transitive Chain" A[Animal: Babu] -- "Points to" --> S[Species: Capybara] A -- "Shadow Fact" --> D[Diet: Herbivore] Note["Diet depends on Species, not Babu!"] end subgraph "3NF: The Source" S2[Species: Capybara] -- "Source of Truth" --> D2[Diet: Herbivore] A2[Animal: Babu] -- "Link" --> S2 end ``` | Normal Form | Rule | Example Violation | | :---------- | :----------------------------------------------------------- | :--------------------------------------------------------- | | **1NF** | Every attribute is atomic — no nested arrays or comma lists | `diet_type = "Herbivore, Omnivore"` (two values in one cell) | | **2NF** | No partial dependency on a composite key | Storing `item_supplier` in an `order_items` table | | **3NF** | No transitive dependency — non-key attributes depend only on the key | Storing `diet_category` in the `animals` table | The Cafe enforces 3NF by keeping species attributes in `species` and pointing to them via foreign key. This is not pedantry — it is what allows a single `UPDATE species SET diet_type = 'Plant-Based' WHERE id = 1` to propagate immediately across every related animal record without touching the `animals` table at all. > [!TIP] > Denormalization (intentionally breaking these rules for performance) is a real and sometimes correct choice. We will encounter it in **[[Manuscript/06 - Resource Management & Processes/6.2 - The Private Desk (Work Mem)|Chapter 6]]** when discussing materialized aggregations, and in **[[Manuscript/08 - Distributed Scaling & Clouds/8.3 - Table Partitioning (Dividing the Dining Room)|Chapter 8]]** when discussing partition strategies.