David Strohmaier

Speculations about Transformers and Compositionality

Warning: Speculative Content. Expect that parts of it will be proven wrong.
  1. The meaning of natural language sentences is compositional.
    • The meaning of an expression \( \mathbf{E} \) syntactically derived from the sub-expressions \( \mathbf{E}_1, \mathbf{E}_2, \dots \) is a function of the semantic value of the sub-expressions Writing \( |\mathbf{E}| \) for the semantic value of the expression \( \mathbf{E} \), the compositional thesis is that: \( |\mathbf{E}| = f( |\mathbf{E}_1, \mathbf{E}_2, \dots | ) \) 1
  2. Transformers (Vaswani et al. 2017) do not correctly implement the compositional semantics of natural language cognition as found in humans agents.
    • Human cognition includes a dedicated mechanism to derive the meaning of the expression \( \mathbf{E} \) from its sub-expressions compositionally.2
    • There is no dedicated mechanism in transformers to derive the meaning of the overall expression \( \mathbf{E} \).
  3. Transformers have to compensate for their lack of directly compositional language processing and partially succeed in this.
    • Attention allows the transformers to partially compensate for lacking a compositional mechanism.
    • The compensatory role of attention is part of the explanation why some attention heads reflect syntactic connections (see section 4.2.1 of Rogers et al. 2020).
  4. The compensation mechanisms of transformers lead to over-contextualisation of later level token embeddings.
    • The over-contextualisation is a partial explanation why transformer embeddings from earlier levels perform better on lexical semantic tasks (cf. Vulić et a. 2020).
  5. Some limitations of transformers will be overcome by using a mechanism that reflects composition more directly.3
    • A mechanism other than attention will be used.

For feedback, comments, and complaints, email me at [email protected]. Links to relevant research are appreciated.



  1. I am using here the more general formula of Kit Fine’s (2007) Semantic Relationism. More commonly the sub-expressions are taken to contribute their semantic values in an atomic fashion, i.e. \( \vert \mathbf{E} \vert = f( \vert \mathbf{E}_1 \vert, \vert \mathbf{E}_2 \vert, \dots ) \) 

  2. This has, to my knowledge, not been established yet. It has been argued for, in some form, by cognitive scientists such as Fodor & Lepore (2002), but my more optimistic assessment of neural methods is hard to square with these arguments. According to my understanding, empirical evidence from the level of cognitive neuroscience is missing or weak (e.g. Pylkkänen 2020). 

  3. This claim does not concern whether dealing with compositionality in the absence of a new mechanism is possible, but instead concerns which development path will be taken due to differences in feasibility. 

Previous Next
10 Years of word2vec: Motiv... Are Transformer LLMs Minds?