Intuition behind self-attention
Self-attention is a key part of the logic at the at the core of modern days AI models, as of 2024 at least. While understanding exactly how it works requires technical knowledge, I find that the intuition behind is quite simple and actually helps also when doing technical work.
It's important to note self-attention is only a tiny part of a bigger architecture, but it is striking that most of state of the art AI models use it.
The intuition behind:
you can see self-attention as an algorithm that creates a summary of input data, be it text or images, that is passed to a model. The clever bit is that the summary is not produced from scratch, rather it's created simply by assigning weights to bits of the original content that it's summarizing.
As an example, I could summarize the paragraph just above by using the words in bold, rather than creating a summary that uses different words.
When it comes to applying self-attention to text, the weight is given to the most "representative" words within a sentence. When it comes to pictures, you can think of it as an heat map. The clever trick of using original content to creates a summary allows a model to figure out autonomously which bits of input data are most representative, using a simple logic.
e.g. in the case of text as input data, a model usually assesses what words are most similar with others (e.g. cat and dog are more similar to animal than they would be to window), and it gives more weight to words that are most similar across the board. The idea is that out of all the words in a sentence, those most similar words are best at distilling the core meaning behind the sentence.
This is clearly a very rough approximation of what a real summary would look like - but it works like a charm!
Interestingly this is the same philosophy behind search algorithms and the page rank algorithm.
There's plenty of information, videos and material on the web for those that want to dig deeper, here I will just point you to the key original paper on attention: Attention is all you need