Intuition behind self-attention

In this post I want to share the intuition around a core mechanism in modern AI models. This post doesn't require technical knowledge, but if you have technical knowledge you might still find the intuition helpful!

This core mechanism ("algorithm") is called self-attention. The algorithm is only a tiny part of any model, but it is striking that most of state of the art AI uses it (as of 2024).

Architecture of transformers and attention blocks, from
Architecture of transformers and attention blocks, from "Attention is all you need" paper

The intuition behind:

you can think of self-attention as an algorithm that creates a summary of some information that is passed to a model ("input data", and it can be anything - images, text, sounds, numbers). The clever bit is that the summary is created simply by assigning weights to bits of the original content that it's summarizing.
As an example, I could summarize the paragraph just above by using the words in bold, rather than creating a summary that uses different words.

The clever trick of using original content to creates a summary allows a model to figure things out autonomously without external supervision - this makes it extremely helpful to aid the learning of any model, as it essentially highlights key pieces of information, removing some noise.

The logic used by self-attention is also quite simple: it just selects bits of data that stand out from the rest.


Example: when it comes to applying self-attention to text, we would highlight words that are the most "representative" within a sentence and remove repetition. The most representative words can simply words that are most similar with others, being those best suited at distilling the core meaning behind a sentence.

In the sentence:

"Cats and dogs, like all pets, like cuddles"

The word pet represents well both cats and dogs, which are hence redundant. We could then summarise the sentence above as "pets like cuddles".

This is clearly a very rough approximation of what a real summary would look like - but it works like a charm!


Interestingly this is the same philosophy behind search algorithms and the page rank algorithm.

There's plenty of information, videos and material on the web for those that want to dig deeper, here I will just point you to the key original paper on attention: Attention is all you need

If you have any thoughts on this, I would be keen to hear them!

Best,
Andrea