Dr. Tiziana Ligorio x Deep Learning, Hunter College of The City University of New York

🙏 Credits: These notes borrow from the following sources:

Transformers

A transformer is a deep learning model that adopts the mechanism of self-attention (defined next), differentially weighting the significance of each part of the input data.
It is not defined by its architecture, rather by its reliance on self-attention as its fundamental operation
In practice, general approaches have emerged on how to combine self-attention layers into a larger network

Issues tackled (sequence processing bottlenecks)

RNNs, LSTMs and GRUs
struggle to retain information and capturing dependencies across long-range tokens due to vanishing or exploding gradients
process sequences step by step, making them inherently sequential, limiting parallelisation during training and inference, leading to slower computation, especially for long sequences
have fixed context window: at any time step a hidden state is limited to summarising past information up to that point

To understand self-attention, we will first describe the basic idea, then we will add more detail to illustrate current common implementations.
A sequence-to-sequence operation that enables models to focus on the most relevant parts of the input when making predictions. It calculates context-dependent importance scores for each token in a sequence in parallel, allowing the model to prioritise key information.
Input: token embeddings + positional encodings (token-aware and position-aware embeddings) of size $d_{model}$ commonly denoted as the model hidden size (📎 e.g. GPT-3 has $d_{model} =12288$) **