End of 2024: Transformers and Transformation

2024 was my year of training Transformer Language Models (LMs). The hours I spent on building, training and debugging my models!

I’ve been building and using NLP models using LMs as the basis for a while, but this year I spent much more time on training them from scratch, trying many variations and learning the hard way how brittle these systems are and how difficult they are to comprehend.

A well-known Feynman quote goes:

What I cannot create, I do not understand.

Wisely, this sentence is a one-sided conditional. It does not guarantee that you understand what you create. My understanding of the models I train is certainly imperfect. Even when they work, when I’ve created them successfully, they remain opaque.

printing press

One can also question to which extent I created my language models. The data, the PyTorch framework, and the CUDA library all have different authors. Writing the first version of the program, choosing architectural details, selecting the data sources, it is easy to believe one is in control. But, at least for me, this feeling was rapidly punctured by the failure of the first version. And then the failure of the second version, and the third, and the fourth…

Debugging a transformer model is hard work and it is often exceedingly difficult to tell whether one has committed a coding error or just selected the wrong hyperparameters (learning rate, batch size etc.). I certainly have done both at the same time, making it nearly impossible to tell why the model underperforms. I went through the code again and again and re-read the relevant passages in two papers again and again.

There have been hours, days, even weeks, in which I wanted to just throw it all out and never touch a piece of Python code again; read a piece of philosophy instead. In those moments, I reminded myself of C.S. Peirce’s claim that some hours in the laboratory would improve philosophers (Peirce 2001, p. 29):

In my opinion, the present infantile condition of philosophy, — for as long as earnest and industrious students of it are able to come to agreement upon scarce a single principle, I do not see how it can be considered as otherwise than in its infancy,—is due to the fact that during this century it has chiefly been pursued by men who have not been nurtured in dissecting-rooms and other laboratories, and who consequently have not been animated by the true scientific Eros, but who have on the contrary come from theological seminaries, and have consequently been inflamed with a desire to amend the lives of themselves and others, a spirit no doubt more important than the love of science, for men in average situations, but radically unfitting them for the task of scientific investigation.

It was not the scientific Eros that animated me when my models failed to perform. My experience was rather one of frustration. But this frustration, too, might establish habits of scientific work and thinking: neither code nor reality bend easily to your wishes or intuitions. It is all too easy to convince yourself of what you want to believe, especially if you have the argumentation skills of an academic philosopher. Making a computer model perform as you wish — that is another matter.

A rising loss
The loss should go down, alas.

When the models finally started to learn, it was less a joyful surprise and more a relief. It should have been working, but for some reason the settings didn’t work — and for some reason, I did not consider one architectural detail for a long time, producing sub-par performance due to an implementation error over and over again.

I imposed some of this pain on myself. Using slightly higher-level frameworks would have helped, for example, the Fabric framework, which I can recommend based on experience in other projects. But I relied primarily on PyTorch. The hours spent on multi-processing issues as a result! But this choice, while it cost me valuable time, meant that I had more control. The hours of debugging I had to deal with were the result of giving myself the opportunity to commit the errors. Taking control confronted me forcefully with my own limitations. I am in control and I am to blame.

It is tempting to blame technology for an experienced lack of agency, especially when one receives the technology packaged and sealed off as a consumer product. Building one’s own model, one reclaims some agency and might learn why one traded off the agency in the first place. But once I ventured down this road — and one could go much further than I did, by not relying on frameworks like PyTorch in the first place, by implementing backpropagation from scratch — I was changed by it. I got to know and overcome some of my limitations.

I’ve described here the personal experience of engaging in my Transformer research, mixed with speculations about its effects on character and habits. But solitary time spent in front of the screen is not the entirety of the scientific effort, far from it. The social exchange of beliefs is a key aspect in Peirce’s motivation for scientific investigation (see its role in The Fixation of Belief in Peirce 1992).

My intention for 2025 is to share more of my work and its fruits. I’ve already started by, for example, teaching a masterclass on LLMs for philosophers at the University of Barcelona. In 2025, I’ll be giving a few talks and, reviewers willing, I might publish one or more papers. The insights gained and habits built up during 2024, I want to refine them and put them to use.

References

  • Peirce, C. S. (1992). The Essential Peirce, Volume 1: Selected Philosophical Writings: (1867–1893) Indiana University Press.
  • Peirce, C. S. (2001). The Essential Peirce, Volume 2: Selected Philosophical Writings: (1893–1913). Indiana University Press.
Previous Next
LLMs, Symbol Grounding, and... Revisiting Politeia