Transformer LMs¶

David Strohmaier, ALTA Institute, University of Cambridge

[email protected]

"We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." ~ Shazeer (2020)

Goals of this Presentation¶

  • Provide an introduction to transformer language models (LMs)
  • Understand basic architecture and its variations
  • Understand sources of failure
  • Highlight details of interest for philosophers and cognitive scientists

Generative AI¶

  • Models that generate text, images, videos etc. that can be directly consumed.
  • In the case of text: typically sequence-to-sequence (seq2seq)
  • Model architectures that can be used for generation:
    • Recurrent Neural Networks
    • Diffusion models (images)
    • Transformers ←today
    • ...

Language Models¶

  • A language model (LM) provides a probability distribution over tokens given a context
  • $ p(word|context) \approx \Phi(context) $
  • Old approach: Counting occurrences
  • Neural language models: Around since the early 2000s

Transformer Models¶

  • Introduced by Vaswani et al. (2017)
  • A type of neural architecture based on the attention mechanism
  • Efficient for processing and producing sequences

original transformer architecture

(image taken from Vaswani et al. (2017) and modified)

simplified transformer architecture

transformer block expanded

Simplifying! (cf. Geva et al. 2022 for feedforward layers)

Attention Mechanism¶

  • A way to contextualise representations
  • Representations are combined based on their similarity (dot-product)

  • "Attention Is all You Need" (Vaswani et al. 2017)
  • "Attention Is not all You Need" (Dong et al. 2021)
  • "Attention is Turing Complete" (Pérez et al. 2021) ← under certain assumptions
  • https://www.isattentionallyouneed.com/
No description has been provided for this image

Image source: Fig 3 in Bahdanau et al., 2015 https://arxiv.org/pdf/1409.0473

No description has been provided for this image
  • Attention is used for interpretation of models.
  • Some attention heads specialise on specific syntactic phenomena (e.g. Clark et al 2019, also Rogers et al. 2020).
  • But it's not just importance.

Semantic Space¶

  • Embeddings/activations: The vector of numbers a layer puts out
  • Embeddings/activations live in a cartesian space
  • Geometric interpretation of embeddings
  • Space exhibits interesting regularities
    • Even when reduced to 2D
    • Why would that be?
No description has been provided for this image

Generation¶

  • How to get from state to vocabulary
  • Different decoding strategies
  • Sampling from the probabilitiy distribution:
    • Temperature: Adjusts probability distribution

Next-word vs. masked-language-model¶

  • Predict one (sub-)token at a time, left to right.
  • Autoregressive language modelling (i.e. taking its own output as input)
No description has been provided for this image

  • Predict (sub-)token anywhere in the sequence.
  • Usually a single token or whole word.
No description has been provided for this image
  • Surprising differences!

The world is everything that is the ___.¶

No description has been provided for this image

The world is everything that is the ___.¶

No description has been provided for this image

Training¶

  • Large amounts of data
  • In batches: predict target word (masked or next) + backpropagation
  • Models might be undertrained

Backpropagation¶

  1. Pass batch of data through
  2. Calculate loss: Divergence between output and target
  3. Use gradient to adjust weights.

Some bold claims¶

  • The neural network is not a blank slate
    • Architectural bias
    • Random initialisation, but which distribution?
    • Lottery ticket hypothesis (Frankle & Carbin 2018)
  • The training is not just gradual
    • Especially when it comes to compositional skills

What I have not covered¶

  • Prompting
  • Hyperparameter search
  • Finetuning
  • RLHF
  • Multi-modality
  • Lots of architectural details: activation, position....
  • ...

References¶

  • Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What Does BERT Look at? An Analysis of BERT’s Attention. In T. Linzen, G. Chrupała, Y. Belinkov, & D. Hupkes (Eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (pp. 276–286). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4828
  • Dong, Y., Cordonnier, J.-B., & Loukas, A. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. Proceedings of the 38th International Conference on Machine Learning, 2793–2803. https://proceedings.mlr.press/v139/dong21a.html
  • Frankle, J., & Carbin, M. (2018, September 27). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. International Conference on Learning Representations. https://openreview.net/forum?id=rJl-b3RcF7
  • Geva, M., Caciularu, A., Wang, K., & Goldberg, Y. (2022). Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 30–45. https://aclanthology.org/2022.emnlp-main.3
  • Pérez, J., Barceló, P., & Marinkovic, J. (2021). Attention is Turing-Complete. Journal of Machine Learning Research, 22(75), 1–35.
  • Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8, 842–866. https://doi.org/10.1162/tacl_a_00349
  • Shazeer, N. (2020). GLU Variants Improve Transformer (arXiv:2002.05202). arXiv. https://doi.org/10.48550/arXiv.2002.05202
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you Need. Advances in Neural Information Processing Systems, 30. https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html