Dr. Tiziana Ligorio x Deep Learning, Hunter College of The City University of New York
🙏 Credits: These notes borrow from the following sources:
Transformers
- A transformer is a deep learning model that adopts the mechanism of self-attention (defined next), differentially weighting the significance of each part of the input data.
- It is not defined by its architecture, rather by its reliance on self-attention as its fundamental operation
- In practice, general approaches have emerged on how to combine self-attention layers into a larger network
Issues tackled (sequence processing bottlenecks)
- RNNs, LSTMs and GRUs
- struggle to retain information and capturing dependencies across long-range tokens due to vanishing or exploding gradients
- process sequences step by step, making them inherently sequential, limiting parallelisation during training and inference, leading to slower computation, especially for long sequences
- have fixed context window: at any time step a hidden state is limited to summarising past information up to that point
Self-Attention (the basic idea)
- To understand self-attention, we will first describe the basic idea, then we will add more detail to illustrate current common implementations.
- A sequence-to-sequence operation that enables models to focus on the most relevant parts of the input when making predictions. It calculates context-dependent importance scores for each token in a sequence in parallel, allowing the model to prioritise key information.
- Input: token embeddings + positional encodings (token-aware and position-aware embeddings) of size $d_{model}$ commonly denoted as the model hidden size (📎 e.g. GPT-3 has $d_{model} =12288$) **