![[Tensorflow Playground Decomposition.png]] A neural network is a way to hierarchically decompose complex input data into related features, then recompose it into meaningful abstractions Each layer is a map between inputs and activations for certain feature which can reduce the complexity of the input space to a high dimensional spacenput space to a high dimensional spa ## Multi Head Attention Certainly! Let's break down how Multi-Head Attention could help interpret the sentence: "A cat wanders into a room and kneels down to drink from a bowl of milk." ### Setting the Stage Suppose our model has 3 attention heads. Each head will look at the sentence but focus on different relationships between the words. 1. **Head 1** might specialize in capturing the subject-verb-object relationships. 2. **Head 2** could be looking for contextual or semantic meaning, like understanding that "room" and "bowl" are places or containers. 3. **Head 3** might focus on the actions, identifying sequential or cause-effect relationships. ### Computing Attention For simplicity, let's focus on the word "drink" and see how each head attends to the rest of the sentence. 1. **Head 1 (Subject-Verb-Object)** - Query for "drink" will give it high attention scores for "cat" (subject) and "milk" (object), as they are directly related to the action of drinking. 2. **Head 2 (Context/Semantics)** - This head might give higher weights to "room" and "bowl," understanding that these are locations where the action can take place. 3. **Head 3 (Actions/Sequences)** - This head may link "wanders," "kneels down," and "drink" as a sequence of actions. It recognizes that "kneels down" likely precedes "drink." ### Aggregating Outputs After each head computes its weighted sum of all the words for "drink," the outputs are concatenated and passed through a linear layer to integrate the insights from all heads. ### Intuitive Understanding Imagine three detectives are investigating what the word "drink" is doing in this sentence: 1. **Detective 1** notes who is doing the drinking and what is being drunk: "It's the cat, and it's milk." 2. **Detective 2** identifies where this action could be relevant: "Hmm, it's in a room, and there's a bowl involved." 3. **Detective 3** looks at the sequence of actions: "First, the cat wanders, then kneels, and finally drinks." All their observations are then combined to give a thorough understanding of the role "drink" plays in the sentence. By attending to different aspects of the sentence, Multi-Head Attention allows the model to have a richer understanding of the text, which is especially useful for tasks like translation, summarization, and question-answering. Given your interest in being proof-oriented and scientific, you might appreciate how this mechanism allows the model to generate multiple "hypotheses" about each word and then integrate them into a coherent understanding.