• ML Spring
  • Posts
  • Attention is all you need: Self Attention Clearly explained!

Attention is all you need: Self Attention Clearly explained!

An illustrated guide! ✍️ Author: @akshay_pachaar

In 2017, a groundbreaking paper titled "Attention is All You Need" introduced the transformer architecture, which led to the Large Language Model (LLMs) revolution that we witness today.

At the heart of this architecture lies the attention mechanism.

In this post, I'll clearly explain self-attention & how it can be thought of as a directed graph.

Before we start a quick primer on tokenization!

Raw text → Tokenization → Embedding → Model

Embedding is a meaningful representation of each token (roughly a word) using a bunch of numbers.

This embedding is what we provide as an input to our language models.

Check this👇

The core idea of Language modelling is to understand the structure and patterns within language.

By modeling the relationships between words (tokens) in a sentence, we can capture the context and meaning of the text.

Now self attention is a communication mechanism that help establish these relationships, expressed as probability scores.

Each token assigns the highest score to itself and additional scores to other tokens based on their relevance.

You can think of it as a directed graph 👇

To understand how these probability/attention scores are obtained:

We must understand 3 key terms:

  • Query Vector

  • Key Vector

  • Value Vector

Crafting Keys, Queries, and Values:

For each token, we derive three elements: a key (K), a query (Q), and a value (V).

Let’s relate this to a library analogy:

🔹 Keys (K): Imagine each book in a library has a unique identifier. This identifier, or key, helps us locate the book on the shelf.

🔸 Query (Q): Suppose you’re looking for a specific book. The information you have about this book acts as the query, which is used to search through the library.

🔹 Value (V): Once you find your book, the content inside it is the value. It holds the actual information you’re looking for.

So as we derive K, Q & V for each token, we know what a token contains, what else it is looking for & what it will provide to other tokens.

Building a context aware embedding: A 3-Step process!

With our keys, queries, and values in place, it’s time for the tokens to start interacting!

(refer the image below as you read ahead)

🔹 Step 1: Attention score calculation:

Each token’s query (Q) interacts with all the keys (K) in the sentence, including its own.

This interaction is a dot product, followed by scaling.

The result? An attention score, representing the relationship between our token and every other token in the sentence.

Example:

Consider the sentence "I love tennis".

Let's say our token is "I". The query for "I" interacts with the key for every word, revealing how closely "I" should attend to "love" & "tennis".

🔸 Step 2: Softmax and Weighting

With attention scores in hand, the next step is normalization.

We apply the softmax function to these scores, ensuring that they sum up to 1.

This gives us the weights – indicating the level of attention each token should pay to every other token.

For example, "I" might pay most of its attention to "love" and less to "tennis".

🔹 Step 3: Constructing the Context

Now that we have the weights, we use them to compute a weighted sum of the values (V).

This sum is the context representation of the token, including information from other tokens based on the calculated attention weights.

So, the context of "I" would be a combination of the information from "love" and "tennis", with more emphasis on "love" as per our example.

Here's an illustration of what we discussed so far👇

A big shout-out to AbacusAI for supporting my work.

The world's first end-to-end ML and LLM Ops platform where AI, not humans, build end-to-end Applied AI agents and systems at scale.

Check this out: https://abacus.ai/

Implementing self-attention using PyTorch, doesn't get easier! 🚀

It's very intuitive! 💡

Check this out 👇

I would also encourage you to check LightningAI’s ⚡️ LLM learning Lab.

A curated collection of blogs, tutorials, and how-to videos on:

  • Training

  • Fine-tuning

  • And deploying LLMs 🚀 

Hope you enjoyed reading, share it across your socials to support my work!

Until next time!

Cheers! 🥂 

Subscribe to keep reading

This content is free, but you must be subscribed to ML Spring to continue reading.

Already a subscriber?Sign In.Not now