Transformer LMs¶
David Strohmaier, ALTA Institute, University of Cambridge
"We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." ~ Shazeer (2020)
Goals of this Presentation¶
- Provide an introduction to transformer language models (LMs)
- Understand basic architecture and its variations
- Understand sources of failure
- Highlight details of interest for philosophers and cognitive scientists
Generative AI¶
- Models that generate text, images, videos etc. that can be directly consumed.
- In the case of text: typically sequence-to-sequence (seq2seq)
- Model architectures that can be used for generation:
- Recurrent Neural Networks
- Diffusion models (images)
- Transformers ←today
- ...
Language Models¶
- A language model (LM) provides a probability distribution over tokens given a context
- $ p(word|context) \approx \Phi(context) $
- Old approach: Counting occurrences
- Neural language models: Around since the early 2000s
Transformer Models¶
- Introduced by Vaswani et al. (2017)
- A type of neural architecture based on the attention mechanism
- Efficient for processing and producing sequences
(image taken from Vaswani et al. (2017) and modified)


Simplifying! (cf. Geva et al. 2022 for feedforward layers)
Attention Mechanism¶
- A way to contextualise representations
- Representations are combined based on their similarity (dot-product)
- "Attention Is all You Need" (Vaswani et al. 2017)
- "Attention Is not all You Need" (Dong et al. 2021)
- "Attention is Turing Complete" (Pérez et al. 2021) ← under certain assumptions
- https://www.isattentionallyouneed.com/
Image source: Fig 3 in Bahdanau et al., 2015 https://arxiv.org/pdf/1409.0473
- Attention is used for interpretation of models.
- Some attention heads specialise on specific syntactic phenomena (e.g. Clark et al 2019, also Rogers et al. 2020).
- But it's not just importance.
Semantic Space¶
- Embeddings/activations: The vector of numbers a layer puts out
- Embeddings/activations live in a cartesian space
- Geometric interpretation of embeddings
- Space exhibits interesting regularities
- Even when reduced to 2D
- Why would that be?
Generation¶
- How to get from state to vocabulary
- Different decoding strategies
- Sampling from the probabilitiy distribution:
- Temperature: Adjusts probability distribution
Next-word vs. masked-language-model¶
- Predict one (sub-)token at a time, left to right.
- Autoregressive language modelling (i.e. taking its own output as input)
- Predict (sub-)token anywhere in the sequence.
- Usually a single token or whole word.
- Surprising differences!
The world is everything that is the ___.¶
The world is everything that is the ___.¶
Training¶
- Large amounts of data
- In batches: predict target word (masked or next) + backpropagation
- Models might be undertrained
Backpropagation¶
- Pass batch of data through
- Calculate loss: Divergence between output and target
- Use gradient to adjust weights.
Some bold claims¶
- The neural network is not a blank slate
- Architectural bias
- Random initialisation, but which distribution?
- Lottery ticket hypothesis (Frankle & Carbin 2018)
- The training is not just gradual
- Especially when it comes to compositional skills
What I have not covered¶
- Prompting
- Hyperparameter search
- Finetuning
- RLHF
- Multi-modality
- Lots of architectural details: activation, position....
- ...
References¶
- Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What Does BERT Look at? An Analysis of BERT’s Attention. In T. Linzen, G. Chrupała, Y. Belinkov, & D. Hupkes (Eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (pp. 276–286). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4828
- Dong, Y., Cordonnier, J.-B., & Loukas, A. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. Proceedings of the 38th International Conference on Machine Learning, 2793–2803. https://proceedings.mlr.press/v139/dong21a.html
- Frankle, J., & Carbin, M. (2018, September 27). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. International Conference on Learning Representations. https://openreview.net/forum?id=rJl-b3RcF7
- Geva, M., Caciularu, A., Wang, K., & Goldberg, Y. (2022). Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 30–45. https://aclanthology.org/2022.emnlp-main.3
- Pérez, J., Barceló, P., & Marinkovic, J. (2021). Attention is Turing-Complete. Journal of Machine Learning Research, 22(75), 1–35.
- Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8, 842–866. https://doi.org/10.1162/tacl_a_00349
- Shazeer, N. (2020). GLU Variants Improve Transformer (arXiv:2002.05202). arXiv. https://doi.org/10.48550/arXiv.2002.05202
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you Need. Advances in Neural Information Processing Systems, 30. https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html