David Strohmaier

ELM3: Contrafactives and Interdisciplinary Work

Fri, 21 Jun 2024 08:57:13 +0100

Last week, I had the pleasure to present a poster at the third conference on Experiments in Linguistic Meaning (ELM3). My poster presented the latest collaborative work with Simon Wimmer on the topic of contrafactives. Click here to get the poster as a PDF.¹ A paper fill follow.

Walking interested conference participants through the poster, I liked to start as follows:

We are investigating contrafactives, a type of verb that does not exist.

I used the paradoxical statements hook, but of course it twists the matter a little: Simon and I are interested in the non-existence of contrafactives. Why are contrafactives — verbs that attribute a propositional attitude and presuppose the falsehood of the attitude’s content — not lexicalised in any language?² Why are factives such as “know” (near?) universal, while contrafactives are entirely absent from the lexicon? Why is there no word that has the meaning of “falsely believe”?

Simon and I had proposed that contrafactives might be harder to learn and found some limited evidence for it in two previous papers considering comprehension. This time we looked at generation and didn’t find any evidence to that effect at all. What does that tell us about the previous two papers? I don’t know. The previous results might have been flukes, despite the statistical significance of the results. Or there could be an asymmetry between comprehension and production. We hope that future research can solve our puzzlement.

Observations on ELM3 and Interdisciplinarity

The conference was interdisciplinary, combining linguistics and cognitive science, with a pinch of philosophy added for flavour. Many of the methods were computational: Computer-based corpus linguistics, Bayesian inference, etc. Nonetheless, I stood out as a computer scientist because these days working in NLP is almost a proper subset of working on neural network models. I was a little surprised to that find my post was one of very few pieces of research using transformer models.³

With all the publicity for LLMs these days, it is almost encouraging to see that other work continues. Given all the resources being committed to LLMs and the like, I’m worried about an excessive concentration of research capital. ELM3 assuaged these fears somewhat, although chatting with PhDs and postdocs about the lack of job openings in formal semantics reminded me of the limitations of a field without direct industry application — philosophy PhDs are familiar with the situation.

I don’t mind having been the only researcher at the conference to train or fine-tune a transformer for their work. That being said, I’m a bit puzzled by how little impact the relative success of neural models had on the cognitive science aspect of ELM3. I heard more about LoT than connectionism. Does the success of neural methods in NLP not support a connectionist approach to cognition? The approach might be incorrect, but some more discussion might have been in place. What do we have to rethink about semantics if transformer models capture at least some aspects of it?

Footnotes

I corrected a typo after the conference. ↩
At least so far no one has been able to produce an example. ↩
I recall seeing one poster by Karl Mulligan and Kyle Rawlins using BERTScores. ↩

A Debate about Words

Thu, 02 May 2024 11:57:13 +0100

Introduction

Due to their lack of success in resolving problems, philosophers like to think that they are failing productively. For example, a philosopher might suggest that while the problem hasn’t gone away, one sees it more clearly after the debate. Such insight is a consolation prize awarded to the readers for the failure of the authors.

This blog post discusses one unsuccessful debate, a debate about the nature of words. I will summarise a series of papers and how the debate ends up being less than successful. Don’t hope for a final answer to conclude this post. My goal is merely to document one debate on the issue on the way towards the answer — and to warn against being side–tracked by metaphysics.

To those not engaged in philosophy of language and metaphysics, some terms in my summary and discussion might be unfamiliar. I’ve tried to include links in those cases, but the larger points I make at the end of the post should be accessible even if one never makes sense of those terms.

Kaplan’s Original Paper

In 1990, David Kaplan’s paper “Words” appeared, discussing the philosophy of words: What are they? What is the nature of words?

Kaplan is moved by issues relating to direct reference theory. Consider the statement “Hesperus = Phosphorus”. As it happens, both names refer to Venus, hence the identity statement is true. At one point in history the identification was a genuine discovery. But if the names refer directly, that is without mediation by e.g. descriptions (“Hesperus is the brightest star in the evening sky”), then how can the identity statement be informative?

To address such puzzles, Kaplan considers giving words a cognitive role. Hence, he is looking for a theory of words that aligns well with using a difference in names as a cognitive difference that can explain substitution effects in identity statements.

The target of Kaplan’s criticism is a form–based type–token conception of words, according to which words would be utterance tokens individuated based on their form that instantiate types of words. In its place, Kaplan proposes what he called back then a stage–continuant theory of words, according to which words were objects in this world with an initial creation event followed by repetitions and storage.

As part of his proposal, Kaplan faces the question of how to individuate words: What makes two utterances a repetition of the same word? Kaplan stresses intent. I might be mispronounce a name, but as long as I intend to speak a word I have previously acquired, my utterance will be a stage of it.

Because Kaplan is primarily interested in semantic puzzles typically phrased using names and their role in identity statements, proper names play an outsized role (Kaplan 1990: 110):

I have spoken of words, though my examples have often involved names. And truth to tell, it is names at which I aim. It is names that have been thought to challenge direct reference theory.

Due to this focus, Kaplan also discusses how people can be said to share a name. Hume, Kaplan, and yours truly all happen to be called “David”, but the reference seems to differ. In response to such worries, Kaplan distinguishes common currency names from generic names. Only common currency names are used as words, while generic names are cultural artefacts on which we draw for giving proper i.e. common currency names. Hume, Kaplan, and I share a name in the sense of having been given different common currency names using (?) one generic name.

Hawthorne and Lepore’s Response

In 2011, the debate continued with a response by John Hawthorne and Ernie Lepore, entitled “On Words”. The almost 40 page long paper response thoroughly to Kaplan, although not always honing in on what bothered Kaplan himself. For example, Hawthorne and Lepore objected at length against Kaplan’s stage–contiuant proposal, interpreting as a form of four–dimensionalism.

More on point, Hawthorne and Lepore worry about the role of intent in Kaplan’s account. Surely, at some point intent is not enough to make utterances instances of a word. If you utterly fail by community standards to speak the word you intended to speak, you have not uttered the word?

Based on the length of discussion, Hawthorne and Lepore’s main target is Kaplan’s distinction between common currency and generic names. They painstakingly go through different possible motivations for the distinction and reject one after the other. Again, this is an issue that is rather specific to proper names and I struggle to see why much would hang on it. Consider the questions: Should we say that Hume, Kaplan, and I have one name? How literally should we take the claim that we share a name? When it comes to the nature of words, these questions are a sideshow. At best, they are getting at something more important: What role does reference play in the individuation of words? To answer that question, we should consider words other than names.

Interestingly, Hawthorne and Lepore end on a sceptical note, doubting whether metaphysics can provide individuation criteria for words, either because the facts accessible to us are insufficient to establish them, or because words have no proper place in our final ontology. They don’t see much hope even when their lexeme-like conception of words were “supplemented with the tools of theoretical linguistics” (Hawthorne and Lepore 2011: 485).

Kaplan’s Response to the Response.

Kaplan responded with a further paper: “Words on Words”. A few misunderstandings are cleared up, and reading the paper is becomes apparent that Kaplan is much less wedded to any metaphysics than his interlocutors presumed — definitely less than to a good joke. Kaplan does not want to commit to four–dimensionalism, i.e. the metaphysical interpretation of his continuant–stage proposal by Hawthorne and Lepore. Kaplan also willing to accept the type–token distinction as long as one jettisons the form–based criteria for individuating words.

When it comes to the individuation of names, however, Kaplan sticks to his guns, defending both the role of intent and the common currency vs. generic names distinction. It becomes again apparent that the real subject of interest is not words in general, but names as they figure in arguments lobbed against direct reference theory. The question that troubles him is specifically whether he and Hume share a name (conceived of as a word) or have two different ones. Kaplan cares about the individuation of words insofar it might bear on the puzzles threatening direct reference theory.

As an aside, the response to the response is also the funniest paper in the debate. Kaplan cracks jokes on nearly every page. If you don’t solve the problem, you might at least be funny.

Bromberger’s Contribution

But there is another paper in the debate, one with a less catchy title: “What Are Words? Comments on Kaplan (1990), on Hawthorne and Lepore, and on the Issue”. In this brief paper, Sylvain Bromberger confronts the original contribution by Kaplan as well as the response by Hawthorne and Lepore with the linguistic reality.¹

They key to the paper’s title is at the end: “on the issue”. My impression is that Bromberger is the only one to primarily care about words as units of language. Bromberger is invested in the topic, and in his paper argues that much is missing from the debate (Bromberger 2011: 489–490). For a start, he stresses that words function as constituents of phrases and sentences” (Bromberger 2011: 490), something that is oddly absent from the debate (except for identity statements). Names do not exhaust the set of words and words generally have their roles in sentences.

The paper brings in the linguistics that was previously consigned to footnotes, but in Brombergerian fashion, it ends on a sceptical note about our current epistemic status regarding the nature of words. As a result of his scepticism, Bromberger does not even claim to answer the question “what are words”, but instead hands out one of the philosophical consolation prizes (2011: 503):²

But at least we are at a point where we can appreciate with some precision what we know we do not know.

Conclusion

The question into the nature of words broad, sprawling, and to answer it we have to integrate large amounts of disparate information. We should not expect to do much better than Kaplan, Hawthorne, Lepore, and Bromberger — all serious philosophers and researchers in their own rights! — unless we build upon their work and that of others. Although I did not expect an answer, I looked into the debate, because I want to keep my eyes on the actual prize, the answer to the question what words are.

The debate serves as a warning about taking the question too lightly. Kaplan was troubled by a challenge to one of the most influential semantic theory, and thought a quick discussion of the nature of words could help resolve it. But the nature of words, what they are and how words can be individuated, is far too subtle a topic to allow a quick discussion to then be used for other purposes. The danger of ending up with a partial picture — in Kaplan’s case a picture limited primarily to proper names seen through the lens of semantics — is too great.

While Hawthorne and Lepore are motivated by broader concerns, they set up their response in a rather limiting way (2011: 448):

Our aim in this paper is to further advance an understanding of the nature of words, both by remedying the problems with Kaplan’s account, and also by achieving a suitable perspective on what the metaphysical investigation of word identity can hope to achieve.

Hawthorne and Lepore target the shortcomings of Kaplan’s account and otherwise discuss specifically the metaphysical investigation of the individuation criteria of words. Those are issues that can be debated, but they cover just a tiny fraction of the topic circumscribed by “the nature of words” and largely neglect linguistic or cognitive considerations.

Metaphysical issues are an excellent way to get side–tracked prior to proper engagement with a subject matter. This is not to say that metaphysical questions are nonsensical or that their answers are epistemically inaccessible,³ but it is a warning about the proper place of metaphysics. The greatest hope for addressing the metaphysical issues surrounding the nature of words lies in accumulating sufficient empirical knowledge about their linguistic nature to then bring it to bear on the metaphysical questions. The Kaplan–Hawthorne–Lepore exchange would have been improved by avoiding any of the issues surrounding four–dimensionalism and discussing e.g. compound nouns at greater length. The role intent is closer to the subject matter — cognition has some bearing on the nature of words — but the exchange on this issue does not reach very far.

The lesson is that empirical complexities cannot just be ignored away by focusing on those areas least accessible to empirical investigation. While it might appear more philosophical to debate four–dimensionalism and the role of intent, that does not make it the right approach to uncover the nature of words. That lesson is in line with Kripke’s insight: The nature of water needed to be uncovered using the relevant sciences, in this case chemistry and physics. Why would words be so different?

References

Bromberger, S. (2011). What Are Words? Comments on Kaplan (1990), on Hawthorne and Lepore, and on the Issue. Journal of Philosophy, 108(9), 486–503.
Hawthorne, J., & Lepore, E. (2011). On Words. The Journal of Philosophy, 108(9), 447–485.
Kaplan, D. (1990). Words. Proceedings of the Aristotelian Society, Supplementary Volumes, 64, 93–119.
Kaplan, D. (2011). Words on Words. The Journal of Philosophy, 108(9), 504–529.

Footnotes

The initial footnote of Bromberger’s paper makes clear that he never got proper access to Kaplan’s response to Hawthorne and Lepore. ↩
I gather that this might be the “golly value” described by Bromberger in his paper “Rational Ignorance”. ↩
Although that too might be the case. ↩

Daniel Dennett (1942-2024)

Sat, 20 Apr 2024 09:57:13 +0100

Daniel Dennett has passed away.

While my own connection to Dennett was limited, I want to share a few of memories, because these moments spent with and around Dennett impressed me greatly.

I met Dennett during my visit at Tufts University in 2017, when I was in the middle of my PhD in philosophy. I don’t believe I had been aware of Dennett’s presence at Tufts when I initially planned the trip, but once I caught wind of it, I had to sit in on one of his courses. Dennett was willing to let me attend his course; on the condition that it wasn’t too overbooked and I didn’t take away a place from a registered student.

It was an undergraduate course on his then just out book “From Bacteria to Bach and Back: The Evolution of Minds”. The crowd thinned out a little as the term progressed, but I stuck around. In addition to Dennett’s own book, I also read Peter Godfrey-Smith’s “Other Minds” for discussions in the class. For a few classes, Dennett was away, engaged in some professional manner, perhaps giving a talk or presenting his book. The weeks he was there were always a highlight. Hopefully, my contributions as a PhD student amongst Bachelor students were not too obnoxious.

Dennett expressed dissatisfaction with the turns academic philosophy had taken: The all–too–common disciplinary navel–gazing lacking any serious engagement with science. The inability of philosophers to imagine possibilities and their insistence that their lack of imagination was a proof of something. Dennett was impatient with some questions of metaphysics, or other intellectual puzzles that philosophers entertain themselves with — chmess! — because he was aware that actual progress can be made in science.

By sticking around, I got to know Dennett a little better. I remember having pizza with him, or rather sitting next to him having pizza with the others, since there was no vegan option. It didn’t matter; I was listening to this great mind and his anecdotes, many of which I was happy to read again in his recent autobiography. His life was truly worth a book and more.

One could just hang out with Dan and the crowd forming around him, and end up having a debate about cognition, or evolution, or AI, or anything else that caught one’s intellectual fancy. Dennett was accessible and made everything around him accessible. Not just philosophy, but science and art as well, the entirety of the intellectual world. One just had to not let oneself be intimidated, be imaginative, and make one’s case. I’m deeply grateful that Dennett was open to such easy engagement and let me be part of it for a few months.

For more memories of Dennett see:

Suggestions for Better AI Criticism

Sun, 18 Feb 2024 14:25:00 +0000

Although I am acutely aware of the shortcomings of the current generation of AI models (transformer-models in particular), most of the criticisms of AI I stumble upon on the internet have become repetitive and lacking insight. I am not interested in picking on anyone here, I’m interested in reading more interesting criticisms. Therefore, I will provide a list of proposals for better AI criticism.¹

My suggestions pertain to criticism of the current abilities of AI, rather than criticism of their broader social consequences (supposed harm, replacement of human activities etc.). The criticisms I have in mind are of the blog-post length and formality, not academic. The list is neither complete, nor beyond dispute. Hopefully, it serves as a source of sharpening ideas. Here are my suggestions:

In your criticism, be specific about the architecture, or the family of architectures. Not all neural network technologies are transformer models, or even more specifically GPT models. Seek to tie your criticism back to the specifics of the architecture. For example, “as a transformer, the model lacks a bias about the order in which to process a sequence and therefore…”. The more general the target of the criticism, the stronger the argument needs to be. Unless you have an excellent argument, I advise against dismissing neural networks in general.
Seek to distinguish whether the shortcomings are due to architecture, data, training time, or another factor. If you cannot be sure about the source, avoid claims relying on what the source is. Fewer claims are possible when the details of the model are secret, as in the case of most OpenAI model. While this is frustrating, it limits what diagnoses are warranted.
When there are good reasons to be critical of the hype surrounding AI technologies, avoid policing of emotional reactions. It is fairly unproductive to scorn people for being impressed by what current models can do, not least because by the expectation of two decades ago, the models perform impressively. Instead, provide insight into the models and their limitations, so that people can adjust their reactions according to the reasons provided.
Avoid reductive claims as a standalone form of criticism, i.e. claims that models are “just x”. For example, the claim that language models are just statistical models for predicting the next/masked word is on its own rather uninteresting, unless it is embedded in a larger argument. If you go down the route of the larger argument, consider whether the reductive claim is true due to the model architecture and therefore general, or only due to a specific usage of the architecture. For example, in transformers the tokens can also easily be made to stand for other elements in a sequence than words. They do not have to be just a statistical model for predicting the next word.
When criticising examples created by models, think in terms of distributions. From where in the distribution of model output are the examples taken? Are the samples cherry-picked, that is, are they examples of especially good performance? Then, stricter criteria for their evaluation are warranted. Are they representative of the entire distribution of model output? Then, it is more appropriate to give a sense of the range of the examples. “Out of 5 examples, all showed behaviour x” is a very different statement for a cherry-picked sample of output and a more representative one. Be open about the sampling.
When comparing to human cognitive capacities, provide evidence to support your comparison. Unchecked by evidence, we are dubious judges of our abilities. Asserting that people never make a certain sort of error — “a real person would never fail to see that…” — requires empirical data to support the claim.
Be aware of human tendencies in processing input, in this case the output of AI models, and adjust your criticism to it. We tend to be very generous in our interpretation of text, doing our best to make sense of it. We might be less forgiving with other forms of input (e.g. video). The targeting of your AI criticism should reflect more than our human processing biases.
Moving goal posts can be acceptable, but provide a justification for why the goal posts have to be moved. Often we put the goal posts where we believed that they would capture something deeper: Reasoning and understanding. While we might have been mistaken in that judgement, it needs to be argued why we were mistaken and why the new place for the goal post will do any better. That a model beat a goal post is, on its own, not a reason to move the post.
Stay curious.

Footnotes

Better in terms of being intellectually enlightening and pushing forward science. ↩

Compositionality and Word Meaning

Tue, 06 Feb 2024 12:25:00 +0000

Transformer models do not learn compositionality. That is, they do not acquire the ability to construct hierarchical structures from smaller units by repeatedly applying the same rules.¹

I speculated about this a while ago, in this post. More importantly, research has shown that while transformer models perform better on compositionality tasks than previous model types, they still cannot consistently solve it (see also this post). A more recent paper investigating the OpenAI GPT models, which are larger and more sophisticated than most transformer models, has again found that these models fail to learn to act in accordance with the principle of compositionality.

What does limitation mean for whether transformers can capture word meaning?

The meaning of a word is closely tied to what it can contribute to the meaning of the compositional whole. The degree of dependence might vary, for example, the meaning of a name such as “Tom”, might depend little on the compositionality. For other words such as privatives, e.g. “fake”, it is hard to see how to understand their meaning in other ways than as their contribution to the compositional whole. It is important that FAKE in “They paid with fake money, taking the painting with them.” has MONEY within its scope. Thus, the argument goes that transformers must in principle be deficient in lexical semantics due to their current inability to learn compositionality.

Another line of argument, however, goes as follows: Transformers compensate for their lack of compositional abilities by excessively attending to the nuances of lexical meaning. For example, a transformer might pick up that whatever money refers to, it is the kind of thing that is often faked. While we are able to resort to compositional rules to make out the meaning of a sentence, the model has to resort to what information it can gleam about the statistics of the words making up their sentences. Of course, transformer models do not operate on mere bags of words, in all standard versions they have access to positional information and appear able to infer grammatical relations, but the relations between words might serve as the crucial crutch.

Transformers are deficient in lexical semantics and transformers are hyper-attentive to lexical semantics. This statement is no contradiction, because word meaning has multiple aspects and can be processed in multiple ways. Investigating lexical semantics in transformers requires awareness of their shortcomings with regard to compositionality as well the compensation mechanisms they might employ. Given all the challenges of exploring how transformers treat word meaning — the problem of sub-word tokenization, accounting for contextualisation, etc. — this is no trifle.

Footnotes

The definitions of compositionality differ in detail. Within linguistics, one version goes as follows: The meaning of the whole is a function of the meaning of its parts (as structured by syntax). This is a good guiding gloss for linguistics, many papers investigating transformers, however, use something close to what I suggest above. ↩

Book Publication: Preference Change

Thu, 18 Jan 2024 07:25:00 +0000

Now available, an open-access introduction to preference change!

When I got into the topic of preference change, a few years back by now, such an introduction was sorely lacking. I hope many readers find it of value. Much research on preference change remains to be done and I hope the readers can help with that!

Michael Messerli and I have co-authored the book, which has been published by Cambridge University Press in the Elements Series for Decision Theory and Philosophy. I greatly enjoyed working with Michael and everyone else involved in the process of getting this book together. Thank you!

Generative Senses: A Prolog Exercise

Wed, 20 Dec 2023 09:25:00 +0000

Prolog is not the tool of choice for most of NLP nowadays, but this didn’t always use to be the case. The unreasonable effectiveness of neural networks for most practical NLP tasks has led to this shift, since implementing neural networks in Prolog is rather awkward. But some ideas and theories from previous decades are still interesting, for theoretical exploration if not practical application, and they are often well-implemented using Prolog. Along these lines, I have implemented some core operations from Bradley Franks’ 1995 paper “Sense Generation”. You can find the code on github.

This is an intriguing paper, and while it is outdated in some respects, it also prefigures some more recent theories. It suggests a decompositional, quasi-classical approach to concepts. That is, it is proposes that the meaning of words can be split into symbolic components resembling definitions. A full defense of this position would require more than a single paper, so Franks considers only one particularly challenging case: Privatives, words such as “fake” or “false” that can radically change what a word means. A fake gun is no gun at all. Some general adjectives can also have a privative effect under the right circumstances: a rubber duck is no duck and a stone lion is no lion. Bradley’s paper accounts for such effects by representing concepts using attribute-value structures (AVS). Unusually, the main AVS for a lexical entry is split into two sub-AVS:

The central AVS which includes features of the conceptual core.
The diagnostic AVS which includes features used to identify objects falling under the concept.

The operations in privative cases change these AVS. For example, in the case of “fake gun” the conceptual core features of the concept of GUN become negated. In the case of “stone lion”, the operations are more complex because the features for STONE need to be appropriately combined with those of LION. A stone lion has four legs (a diagnostic feature according to Franks), but it is not an organic being or a lion at all (conceptual core features for LION). The operations needed for such combinations are implemented and tested in my Prolog code for three of Franks’ examples: “fake gun”, “stone lion”, and “wild lion”.¹ The paper has much more content, but these operations are at the heart of it.

While the theory is from 1995, the Prolog code is more modern. I used the opportunity to try out a number of recent Prolog innovations to make the code simpler and logically purer. Cuts, a Prolog feature that would most certainly have been used in a 1995 implementation, are completely avoided. Franks’ theory invited the use of reification, that is the explicit representation of truth-values, which together with the reif library made the reasoning much easier. It was a great opportunity to showcase some of modern Prolog’s potential.

My code was tested on Scryer-Prolog, but should work with minor changes on other implementations as long as they have versions of the libraries I used. If you want to try it out and have any problems, feel free to message me about it.

Footnotes

The last one, of course, is not a privative case, since wild lions are lions, but it can serve as a test-case anyway. ↩

Compositionality and Transformers: A Paper

Tue, 29 Aug 2023 12:25:00 +0100

Compositionality is one of the long-standing challenges to neural NLP. I’m myself a bit sceptical that transformers really offer the kind of compositional processing found in human language processing. But even formulating the challenge can be a challenge.

In it’s formulation by Partee (1995), the principle of compositionality states:

The meaning of a whole is a function of the meanings of the parts and of the way they are syntactically combined.

NLP researchers have investigated for a while whether neural network models respect the presumed compositionality of language, but even with Partee’s principle of compositionality in hand, it is not clear what “respecting compositionality” would mean. Is it sufficient if a neural model can process sentences the meaning of which is compositional or are there further restrictions on how to process them?

A paper that explicitly addresses such questions has been put forward by Hupkes, Dunkers, Mul, and Bruni (2020, e.g. page 759). Their paper, entitled “Compositionality Decomposed: How do Neural Networks Generalise?”, split compositionality into five task descriptions¹ and tested sequence-to-sequence models on them.

The five task descriptions are:

Systematicity: The ability to process one sentence guarantees the ability to process a compositionally related sentence. Anyone who understands “The black cat hunts the red bird” will also understand “The red cat hunts the black bird”.
Productivity: From finite semantic components, arbitrarily long semantic wholes can be formed. We can form and understand the phrase: “The cat hunts the bird, which ate the worm, which crawled through the earth, which…”.
Substitutivity: Replacing a semantic component with a synonymous phrase should not affect the overall meaning. For example, it should make no difference to replace “the black cat” with “the cat, which was black”.
Localism: The semantic function only depends on the syntactically local constituents. The meaning of “the red bird” does not differ depending on whether it occurs in the sentence “The black cat hunts the red bird” or “The red bird caught the worm”.
Overgeneralisation: Faced with a compositional function with some exceptions, at first the function will be wrongly applied even in cases where an exception occurs. For instance, a child learning English might use the standard derivation of the past tense for the verb “run”, arriving at “runned” instead of “ran”.

All of these task descriptions can be questioned when it comes to natural language, as the authors of the paper are aware. Using one of their examples, localism appears to be violated when global context is required for disambiguating words. To understand what it means for the bat to fly right into my face, it is important to know whether the event occurred in a cave or on a baseball field. Context beyond the sentence might disambiguate the meaning. But if you disagree with the inclusion of a specific task description, that is no problem for the paper, since you can just ignore those results. If you disagree with all of them, I struggle to see what is left of compositionality.

This is an excellent paper that avoided many of the pitfalls of previous research. It is certainly hard to look at the results — transformers performing reasonable well on substitutivity and systematicity — and think that neural networks are hopeless when it comes to compositionality, although they are obviously not perfect either. The transformer models seem specifically to struggle with productivity and localism, looking at the numbers in Table 1 on page 774. The limits of productivity especially suggest that the model has been unable to properly derive a rule it can arbitrarily apply. The transformer models only reach 50% accuracy on sequences longer than they have encountered before. That is in line with my scepticism about the compositional abilities of transformers.

But I am also worried about the external validity of the positive results, that is whether we can infer from the strong showing of transformers in some of the experiments, e.g. the one investigating substitutivity, that they have the tested for skills also in the case of natural language. My worries are due to the artificial language employed by Hupkes et al. The language is purely instructional. It describes operations on strings of characters that return strings of characters. For example, an operation might be to reverse a string or to repeat it. The neural models are supposed to apply such functions, the end result always being new strings.

The language lacks variables, quantifiers, and negation. As a consequence, the semantic functions are quite different from most of natural language. One cannot even express a thought like “If you switch the first two elements of a string, and then switch the first two element of the resulting string, you arrive at the original string”. The analysis of this proposition requires quantification and variables, which are not available in the language used by Hupkes et al.

But do these shortcomings matter? As formulated above, the principle of compositionality requires that the meaning of the whole is a function of the meaning of the parts. It does not specify the functions. But surely the function matters! Otherwise the function which takes any part and returns the value True would be acceptable. Surely, every neural network can learn such a function, but we wouldn’t call the networks compositional on that basis.

The functions chosen by Hupkes et al. are more interesting. They are not trivial, since different sentences and components map to different string outputs. But they are less complex than the functions required for analysing natural language. Composition turns into a different beast once quantification comes into the picture. Variables need to be dealt with. In natural language, variables of components might be free and therefore have no fixed meaning in the absence of an assignment. Consider a sentence so simple as “A cat hunts a bird”. The analysis of the component “hunts a bird” would include a free variable for the missing agent. Only when the noun phrase of “A cat” is added would this variable be bound.² There is nothing equivalent to such free variables in the Hupkes et al. paper. The incompleteness that pervades the composition of meaning in natural language is absent. Sceptics of neural models have, as a result, an easy time discounting the positive results the paper suggests.

Such problems would not arise if one trained the network on first order logic with a model-theoretic semantics.³ Why not ask the neural network about the truth of a sentence such as:⁴

\[ \forall s, t \in \text{Strings} (\text{switch_1st_and_2nd_character}(s, t) \leftrightarrow \text{switch_1st_and_2nd_character}(t, s) ) \]

Such formal languages are well-understood and training examples can be automatically generated. They can also be decomposed and the components would then include free variables.

I assume that the challenge of interpretation-dependence led Hupkes et al. to avoid such a solution. In model-theoretic semantics the semantic value of a sentence is relative to an interpretation, that is a description of the states of the world that make the sentence either true or false. Accordingly, if the output is supposed to be the meaning of a sentence, this would require the specification of an interpretation, either explicitly or implicitly through training examples. On the explicit approach, one has to feed an entire model as input into the neural network. In the implicit approach, the network has to learn the interpretation from the inputs it is received. Both approaches are more challenging than the one pursued by Hupkes et al.⁵

This challenge, however, can be met. In fact, research I have conducted with Simon Wimmer has provided models with artificial sentences and simple serialisations of a situation that either make the sentence true or false (see Strohmaier & Wimmer forthcoming). Our focus wasn’t on compositional semantics, as we were interested only in the semantic function for attitude ascriptions, and therefore did not include free variables either, but we were able to train a transformer-encoder on something closer to model-theoretic semantics.

There have been some experiments investigating entailment using logical formulas (Evans et al. 2018), but for different logical systems. The COGS dataset (Kim & Linzen 2020), which aims to evaluate compositional abilities using the task of mapping natural language to logical form, is also noteworthy in this context. But as far as I can tell, no one has tested for the task descriptions proposed by Hupkes et al. using first-order logic formulas.

Assuming I haven’t missed something — if I have please email me! — there is room for a further empirical test of the compositional abilities and shortcomings of neural networks. Since I am busy with other research projects at the moment, I haven’t yet further explored this space, but I might come around to it (and if you are interested and able to cooperate on this, send me an email).

References

Evans, R., Saxton, D., Amos, D., Kohli, P., & Grefenstette, E. (2018). Can Neural Networks Understand Logical Entailment? (arXiv:1802.08535). arXiv.
Hupkes, D., Dankers, V., Mul, M., & Bruni, E. (2020). Compositionality Decomposed: How do Neural Networks Generalise? Journal of Artificial Intelligence Research, 67, 757–795.
Kim, N., & Linzen, T. (2020). COGS: A Compositional Generalization Challenge Based on Semantic Interpretation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9087–9105.
Partee, B. H. (1995). Lexical semantics and compositionality. In Language: An invitation to cognitive science, Vol. 1, 2nd ed (pp. 311–360). The MIT Press.
Strohmaier, D., & Wimmer, S. (forthcoming). Contrafactives and Learnability: An Experiment with Propositional Constants. Post-Proceedings of Logic and Engineering of Natural Language Semantics 19.

Footnotes

I chose the name “task description” over “tasks”, since some of these descriptions concern how a task is fulfilled rather than the nature of the task itself. For example, overgeneralisation describes the trajectory of learning a task. ↩
Linguists typically use the lambda-operator to enable this semantic composition and have the variables bound. ↩
Even if neural network can learn the semantics of first-order logit, further challenges would remain. Natural language exceeds first-order logic, for example because it involves quantification over predicates. But these further challenges might be less specific to the compositional abilities of neural networks. ↩
How would this formulate evaluate when the string consists of only a single character? One could return a presupposition-failure in this case. ↩
A model would not be sufficient to determine the semantic value of phrases including free variables, which would occure in an experiment following exactly the lines of Hupkes et al. One could deal with these free variables by providing assignments, or by allowing the model to return an indeterminate value. ↩

LLMs and Human Cognition: Shifting Arguments, Same Assumptions

Sat, 12 Aug 2023 08:25:00 +0100

Large transformer-based language models (LLMs) are performing well on a variety of tasks. This is a reason to reconsider our understanding of language and how humans process it. Especially those sceptical of neural network approaches face a challenge, such as Noam Chomsky and his followers, have come under pressure. They need to justify their scepticism about the abilities of neural networks in the light of apparent counter-evidence. Many justifications are available, a classic one being that LLMs need much more data than humans, but in this post I’ll discuss an argument by Chomsky I hadn’t heard until recently. I will follow my own progress of thought, which starts with an initial reaction of surprise to Chomsky’s argument to the dawning realisation, that it showed less of a change than I had at first suspected.

A few weeks ago I listened to a Tyler Cowen interview with Chomsky, in which the latter made the following argument (see the transcript):

[An LLM] does exactly as well with impossible systems as with languages. Therefore, in principle, it’s telling you nothing about language.

The argument seems to be that LLMs are not only able to learn actual languages,¹ but also other systems of symbols that no human could learn.² Therefore, LLMs diverge so much from human cognition as to be uninformative.

I found that argument surprising, and anyone who was arguing on Chomsky’s side in the 80s and 90s would have found it surprising back then as well. In those heydays of the battle between classical cognitive science and connectionism, philosophers/cognitive scientists like Fodor and Pylyshyn (1988) argued that connectionist models can in principle not learn language (because their representations lack structure). Back then, that was the position broadly aligned with Chomsky. Today, after the rise of LLMs, Chomsky’s argument claims that neural models are not informative because they can learn roughly every pattern of symbols for which we have sufficient data, not just human language.

It’s easy to look at this argumentative shift and think it shows that the critics of neural networks are grasping at straws. At first glance, they appear are forced to completely reverse their original position in response to the rise of LLMs. First they told you that neural networks couldn’t learn enough, and now they are telling you that they learn too much. But that interpretation is making it a little too easy. There is a consistent set of ideas underlying these superficially opposed arguments. These ideas include:

Language acquisition provides core insights about language cognition: The processing of language is tied to how one can acquire language
Innate skills are a core part of language acquisition: Humans do not learn language from scratch, but starting with innate skills that make human-like language cognition possible in the first place

None of these points are trivial. Consider the first point that language learning tells you something about language cognition after the learning has (largely) stopped. A person other than Chomsky might easily take the position that these processes can be treated separately. They might concede that LLMs learn language in a very different way compared to humans and therefore throw little light on language acquisition. At the same time, they might assert once the LLMs are trained they process language in a rather similar way to humans. In other words, the model might capture language processing without capturing language learning/acquisition. For someone with this view, it would be of little interest that LLMs can also learn languages humans cannot learn. As long as they process actual languages the same, why worry that LLMs are also able to process other sequence patterns, if trained on those patterns?

Chomsky’s position rules such a stance out, because the core skill of language cognition is one of hypothesis-driven explanation. In their New York Times opinion piece Chomsky et al. write:

Human-style thought is based on possible explanations and error correction, a process that gradually limits what possibilities can be rationally considered.

According to Chomsky et al., we are seeking to explain and testing conjectured explanations against limited input. Both in acquisition and processing human thought is supposedly based on this core skill.³

Hypothesis-driven rationalist inference is contrasted with mere association resulting from statistical inference. Statistical methods might help in the evaluation of a hypothesis, but the inference process is primarily turning around the symbolic hypotheses themselves, not the statistics. This matters, in Chomsky’s view, because symbolic hypotheses can rule out options, rather than make them merely unlikely, as statistical inference does.⁴

On this picture, language acquisition and processing relies on hypothesis-driven cognition. That is a view Chomsky has long held, probably for the majority of a century by now. This position is clearly continuous with the critiques of the 80s & 90s. In their influential paper from this period, Fodor and Pylyshyn (1988) ended on a closely related noted:

There is an alternative to Empiricist idea that all learning consists of a kind of statistical inference, realized by adjusting parameters; it’s the Rationalist idea that some learning is a kind of theory construction, effected by framing hypotheses and evaluating them against evidence. We seem to remember having been through this argument before. We find ourselves with a gnawing sense of deja vu.

A deja vu indeed! 35 years later Chomsky and his collaborators make the same point again: Symbolic hypotheses are conjectured and then tested against limited evidence. Apparently, the point never gets too old to bear repeating. On its own, however, the point has not had its intended force because the defenders of neural network approaches keeping wondering

whether symbolic hypothesis formation and testing really is at the core of both language acquisition and processing, and
whether neural networks cannot implement a form of this process after all?

In light of the new evidence of LLM performance, it might be sensible to review these two underlying hypotheses. Chomsky does not touch upon those reasons in the interview with Cowen, he presumes them. His reaction to LLMs does not go so far as to question these underlying assumptions.

To put my cards on the table, the performance of LLMs (and their correlation with cognitive measures, see this post and that post) have led me to believe that less of language processing relies on the kind of processes that Chomsky assumes to be the core and more on processes implemented by LLMs. Hard-coded rules or hypothesis-testing-derived rules drive fewer cognitive sub-processes and statistical matching drives more. I have come to doubt the scope of hypothesis 1. This update does not force me to accept that LLMs have sparks of AGI or implement much of human reasoning skills. It, nonetheless, has LLMs partially converge with human language processing.

Listening to the Cowen interview, I was at first struck by how different Chomsky’s rationalist argument had become, only to realise I had been mistaken. If there is a problem with Chomsky’s argument, it is not so much that he has changed his tune. The problem is that the argument continues to rest on the same core assumptions and he hasn’t conceded an inch. Pressed to discuss LLMs, Chomsky does not even discuss these assumptions.

References

Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1), 3–71.

Footnotes

If you want to be specific and use the vocabulary of Chomsky, the models would learn the statistics of external or E-language, not anything about internal or I-language. ↩
LLMs probably cannot learn to predict just any language. In fact, we know that standard Transformers have theoretical limits on learning certain languages. It is not relevant for the rest of the post, however, so I’ll gloss over it. ↩
The New York Times opinion runs together whether something is impossible to learn or whether something can be learned to be impossible. It might be better to keep them distinct. If there are reasons to run them together, they are not obvious from the opinion piece. ↩
This might be granting too much to Chomsky. Why can statistical processes not rule anything out? Why can they not represent with sufficient modal force? It’s not as if you cannot force a neural network to give you an output of 0 or 1 for a label that indicates rule-conformance. This counterexample presumably misses the point, but I have trouble understanding the point without making a lot of controversial assumptions that I see no reason to grant. ↩

Transformers Converging with Cognition: More Papers

Fri, 30 Jun 2023 09:05:00 +0100

A while ago, I wrote up a number of papers (see this post), all of which suggested that transformer models have partially converged with human language cognition. Using various correlational measures and predictions the literature leads towards the conclusion that transformers and human language processing resemble each other.

The rate of publishing in this field being what it is, new papers have come out or have come to my attention:

Goldstein et al.: Thinking ahead: spontaneous prediction in context as a keystone of language in humans and machines
Heilbron et al.: A hierarchy of linguistic predictions during natural language comprehension
Lyu et al.: Finding structure during incremental speech comprehension
Kumar et al.: Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model
Tuckute et al: Driving and suppressing the human language network using large language models
Arana et al.: Deep learning models to study sentence comprehension in the human brain

So far, the overall picture I derive from these papers is unchanged: Transformer-based language model exhibit a considerable convergence with measurements of human language cognition. Many of the papers underline this result. For example, Kumar et al. find convergence not just for the contextualised embeddings of transformer models, but also the weights of the attention heads. The convergence stretches to another aspect of transformers. While it remains a partial convergence, the finding is increasingly robust.

The interpretation of the convergence is still a matter of discussion. Both Goldstein et al. and Heilbron et al. provide further evidence for the importance of prediction, a theme that was also strong in the paper by Schrimpf et al, which I discussed in my previous post. It seems increasingly clear that the human brain engages in a predictive process when processing language. Language modelling, although not necessarily in the exact forms (MLM, CLM etc.) used to pre-train transformers, has been vindicated as a cognitive task.

That both the brain and transformers predict upcoming words and/or linguistic features cannot be the whole story, however. After all, language models based on LSTMs or other RNN models also engage in such predictions, but have been found to show less (though some) convergence with cognitive measurements. What is it specifically about transformers that leads to the convergence? And to repeat an insight gleaned from papers in the previous post: It cannot be just the number of parameters.¹

What, then, is about transformer models that explains the convergence with human language processing? The best answer I found in this new set of papers is that contextualisation matters. The Goldstein et al. paper provides evidence in that direction, comparing standard contextualised GPT-2 embeddings with de-contextualised GPT-2 embeddings and GloVE embeddings.² The standard GPT-2 embeddings perform best. But this does not answer all our questions: Why does contextualisation help? Is it because it addresses issues such as polysemy and homonymy? Or do transformers even partially address such issues as compositionality? (On the latter see this post by me.)

So far, the convergence finding has hold up in the literature. When it comes to interpreting the convergence, however, research is only inching forward with many questions left open. Both sides of the convergence are opaque, hence finding the convergence itself can only be an initial finding, albeit an extremely exciting one!

Footnotes

One paper showing this is the one by Merkx and Frank. ↩
The comparison sadly does not include RNN models, which also provide a form of contextualisation. ↩

Groningen Cognitive Modeling Spring School

Sun, 23 Apr 2023 08:45:00 +0100

I recently had the pleasure to attend the Groningen Cognitive Modeling Spring School. This spring school is an annual event, but I’ve only recently heard of it and applied soon after. I’ve been interested in the interpretation of neural network models as cognitive models for a while, and so it was time for deeper engagement with the dedicated cognitive modelling research.

The spring school had different tracks for three cognitive modelling frameworks:

ACT-R and PRIMs are both classic cognitive frameworks. Writing in ACT-R or PRIMs is similar to writing in a programming language,¹ but the basic constructs are based on a theory about the human cognitive architecture. Symbolic processing is a key part of these frameworks.

I chose the third track and learned a decent chunk of Nengo, taught by Terrence Stewart. In contrast to the two other frameworks, Nengo is a neural network framework. One specifies a neural network in Python, provides it with some input, and let’s the neurons compute, but it is not just another competitor to pyTorch, or Tensorflow. Nengo’s focus is on neuro-biologically plausible models of neural networks. The neurons are spiking neurons, backpropagation is discouraged, and the time dimension matters. This is not your standard Deep Learning framework.

Learning Nengo perfectly suited my goal to better understand how far the distance between the standard NLP neural models and cognitive models is. A look in the literature suggests, that Deep Learning models have started to partially converge with human cognitive processes (see my post on this matter), but they remain biologically and cognitively implausible in many respects. For example, a standard Transformer model does not take relevantly longer to process a complex sentence than a simple one,² while humans certainly do. Time matters in Nengo in ways that I never had to bother with in pyTorch.

The different nature of Nengo was particularly striking to me due to my background. While I was not the only computer scientist at the spring school, as far as I could tell I was the only one coming from NLP and heavily working with Transformer models. The disciplines of NLP and cognitive modelling have grown apart, despite their shared roots and some valiant research efforts to the contrary (especially the Workshop on Cognitive Modeling and Computational Linguistics). The research I am working on aims to bridge this gap. Working with Nengo made the gap more apparent, and hopefully Nengo will be one tool to overcome it.

Nengo Resources

If you want to learn more about Nengo, you can follow these links:

The core algorithm of the Neural Engineering Framework underlying Nengo
Technical overview of Nengo

I’ve also been pointed towards the book “How to Build a Brain” by Chris Eliasmith, who is behind much of the Neural Engineering Framework. So far I haven’t had the time to look at it myself.

References

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, L. (2023, January 23). Universal Transformers. International Conference on Learning Representations.

Footnotes

In fact, ACT-R is heavily LISP-based, showing in what era of cognitive science it emerged. ↩
Of course, there are non-standard Transformer models for which this claim is not true, e.g. when one introduces dynamic halting as in the Model of Dehghani et al. 2023. ↩

Transformer Models Do Not Just Learn Surface Statistics

Mon, 13 Mar 2023 15:08:00 +0000

A common criticism of Transformer models, such as ChatGPT, BERT, and Bard, is that they only learn surface statistics. According to this criticism, the predictions by transformers are superficial, because they do not represent the underlying state. In the case of language, the models would only capture general co-occurrence, on which transformer LLMS are typically trained, but neither the underlying hierarchical nature of language nor anything about the states of the world.

Evidence by now strongly suggests that this absolute criticism is wrong. In the following, I list the papers providing the evidence:

Board games

Transformer models learn states of board games (Chess, Othello) when modelling sequences. This evidence is very convincing in showing that Transformer models are in principle able to recover more than surface statistics.

Toshniwal et al. 2022.
Li et al. 2023

The paper by Li et al. is especially convincing, since they test the role of the state representation using interventions.

Hierarchical Syntax

The states of Transformer language models reflect syntax, including a hierarchical structure which is not obvious from the surface of language strings:

Lin et al. 2019
Tenney et al. 2019
Rogers et al. 2020: 843-844

Layer-Wise Operations

Some layer-wise operations in Transformer models appear to reflect human interpretable concepts. That these operations at least appear associated with meaningful concepts, suggests that they do not just recover meaningless surface statistics:

Geva et al. 2022

(This piece of evidence is perhaps more preliminary than the others.)

Correlations with Psychometric data

Transformer language models appear to have some correlation with psychometric data, including human brain states. Presumably human cognition reflects an underlying world state when processing language:

Wilcox et al. 2020
Merkx & Frank 2021
Michaelov et al. 2021
Oh et al. 2021
Schrimpf et al. 2021
Caucheteux et al. 2022
Caucheteux & King 2022

Conclusion

The evidence presented in these paper supports a role for representation of an underlying state. At this point, I consider the statement “Transformer models only learn surface statistics” to be probably wrong. (My subjective credence that they learn something about the underlying states is around 90%.)

I have not presented here evidence concerning the shortcomings of Transformer models. Such shortcomings exist. Specifically, the evidence I have pointed towards does not rule out that Transformer models are overrelient on surface statistics (for such a suggestion, see also Rogers et al. 2020: 843-844) and fail to model some aspects of the underlying state. The presented evidences also does not show that Transformer models fully capture compositionality, which I personally doubt they do, or that they can fully grasp meaning in the absence of non-textual data.

References

Caucheteux, C., Gramfort, A., & King, J.-R. (2022). Deep language algorithms predict semantic comprehension from brain activity. Scientific Reports, 12(1), Article 1.
Caucheteux, C., & King, J.-R. (2022). Brains and algorithms partially converge in natural language processing. Communications Biology, 5(1), Article 1.
Geva, M., Caciularu, A., Wang, K., & Goldberg, Y. (2022). Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 30–45.
Merkx, D., & Frank, S. L. (2021). Human Sentence Processing: Recurrence or Attention? Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, 12–22.
Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., & Wattenberg, M. (2023). Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (arXiv:2210.13382). arXiv.
Lin, Y., Tan, Y. C., & Frank, R. (2019). Open Sesame: Getting inside BERT’s Linguistic Knowledge. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 241–253. https://doi.org/10.18653/v1/W19-4825
Michaelov, J. A., Bardolph, M. D., Coulson, S., & Bergen, B. K. (2021). Different kinds of cognitive plausibility: Why are transformers better than RNNs at predicting N400 amplitude? ArXiv:2107.09648 [Cs].
Oh, B.-D., Clark, C., & Schuler, W. (2021). Surprisal Estimators for Human Reading Times Need Character Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 3746–3757.
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8, 842–866.
Schrimpf, M., Blank, I. A., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N., Tenenbaum, J. B., & Fedorenko, E. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45), e2105646118.
Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., McCoy, R. T., Kim, N., Durme, B. V., Bowman, S. R., Das, D., & Pavlick, E. (2019). What do you learn from context? Probing for sentence structure in contextualized word representations. International Conference on Learning Representations. https://openreview.net/forum?id=SJzSgnRcKX
Toshniwal, S., Wiseman, S., Livescu, K., & Gimpel, K. (2022). Chess as a Testbed for Language Model State Tracking (arXiv:2102.13249). arXiv.
Wilcox, E. G., Gauthier, J., Hu, J., Qian, P., & Levy, R. (2020). On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior (arXiv:2006.01912). arXiv.

Are Transformer LLMs Minds?

Tue, 28 Feb 2023 12:38:00 +0000

Transformer LLMs, such as ChatGPT, BERT, and Bard, have sufficiently impressed the public that some have described them as AI minds. But is this ascription of a mind justifiable?¹

Betteridge’s law of headlines states:

Any headline that ends in a question mark can be answered by the word no.

I believe this law fails for the present blog post. The correct answer to whether Transformer LLMs are minds is: It is complicated, but it is typically not useful to describe them as minds.

I will make my case by asking and addressing more specific questions:

What are Transformer LLMs?
Do we have a widely accepted set of necessary and sufficient conditions for ascribing a mind?
Is there a somewhat plausible cognitive science theory of mental states (e.g. beliefs) that would describe transformer LLMs?
What kind of concept is the concept MIND?

I will address each of these questions in turn.

What are Transformer LLMs?

The answer to this question is: Transformer LLMs are large language models using the Transformer deep learning architecture.

A language model is a model predicting the occurrence of words (or other linguistic units, such as characters) given a context. Generally, such models excel at predicting what the next element in a linguistic sequence might be. A large language model is simply a language model that has a large number of parameters and has been trained on large amounts of data. For example, GPT-3 had about 175 Billion parameters.

The Transformer architecture (Vaswani et al. 2017) is currently the most prominent version of neural network deep learning. Without going into details of the Transformer architecture, it can be said that it has enabled the networks to make better sense of general contextual information, although often limited to relatively small context windows. For more details, I recommend this post on the inner workings of the original transformer, this post on the family of Transformer models, and for an academic survey of what we know about Transformers see Rogers et al. 2020.

Predicting the next element in a sequence is a very general task. It occurs when creating dialogue responses, as in the case of ChatGPT, but it can also serve as a training task to find generalising model parameters, which can then be used by integrating the Transformers in larger computational systems. For example, it is common to use states from the BERT model (Devlin et al. 2017) by putting additional neural network heads on top. Some Transformer LLMs have also been trained on other tasks than predicting words in context. In the case of ChatGPT, the model has also been trained using reinforcement learning, in addition to the more standard deep learning methods. I will not consider the impact of these strategies in detail.

Do we have a widely accepted set of necessary and jointly sufficient conditions for ascribing a mind?

The answer to this question is: no.

There is no such thing as a scientific consensus on the nature of minds. I personally endorse the computational theory of mind, according to which minds are computational systems. This condition is fulfilled by Transformers, but it is likely insufficient. After all, the chip in a microwave is a computational system and we usually do not consider it a mind.

There are some standard proposals for additional conditions that might be added to arrive at a jointly sufficient set:

Ability to solve problems
Presence of certain mental states such as beliefs (access consciousness)
Phenomenal consciousness
Presence of a non-physical substance (substance dualism)

This list is by no means exhaustive, but it provides a sense of the range of possible conditions for having a mind. Depending on which conditions one endorses, the case for declaring Transformer LLMs to be minds looks quite different.

The first condition is the least restrictive, so it is not surprising that a transformer LLM is likely to fulfil it. LLMs seem to be able to solve some problems, most obviously predicting the next word in a sentence. But with little modification they can also address other problems, such as certain parentheses matching tasks (Weiss et al. 2021) and disambiguating word senses at a (near-)human level (Conia & Navigli 2021). But then, one might worry: Does not a thermostat solve the problem of keeping the temperature at a certain level? I specify a temperature and it figures out when to turn the heating on and for how long. We will come back to that case, but it suggests that the ability to solve problems, broadly understood, might be necessary but insufficient for being a mind.

The second condition requires the computational system to have “access consciousness”, which I equate with having mental states, such as beliefs and wants. Colloquially, we say that someone has a mind of their own to underline that they have beliefs and wants of their own. They are not just tools solving problems for us, but they represent the world as being a certain way and try to intervene in the world to make it align more with how they want it to be. Mental states, such as beliefs and wants characterise the notion of access consciousness.² It is a restrictive, but quite appealing requirement for ascribing a mind. My microwave falls short of it, but humans, most of them anyway, fall under it.

The third condition is the presence of phenomenal consciousness. To have phenomenal consciousness is to experience the world in a qualitative way (Nagel 1974). Common examples are the quality of red and pain. Phenomenal consciousness has been heavily debated and few if any conclusions have been reached. While I strongly doubt that LLMs have phenomenal consciousness,³ it is less clear that we want to make this a necessary condition for having a mind. Especially, in the presence of beliefs and desires, phenomenal qualities appear more of an additional feature. Assume that you met a future AGI-robot and it truthfully confesses to that its experience of the world lacks any quality. It would still be sensible to ask this robot what it beliefs and wants and on some occasion to ask it what’s on its mind.

Requiring the presence of a non-physical substance is the fourth and most restrictive condition. It is strongly associated with the kind of dualism proposed by Descartes. I don’t believe in the existence of a non-physical mind substance, so as far as I am concerned humans fail to meet it as well. That makes it, presumably, a too restrictive condition. Other formulations of dualism avoiding a non-physical substance have been put forward, but usually they mainly attempt to capture phenomenal consciousness, which I have already covered in the previous paragraph.

There is no uncontroversial criterion for mindhood, but I suggest that having mental states is a decent starting point. If we were happy to describe Transformer LLMs as having beliefs and wants, then they would seem excellent candidates for being minds. Conversely, if LLMs lacked these states, we would be more sceptical about their chances. Either way, we could still continue to quibble, but with a much improved understanding of the matter at thand.

For the sake of this post, I will accept access consciousness as a starting point. Accordingly, I will consider a computational system to be minded if it has states sufficiently similar to mental states such as beliefs. This suggestion would be more helpful, of course, if we had a universally accepted theory of such mental states. Unsurprisingly, we lack such a theory and so I will in the next section resort to providing two samples from the realm of plausible theories.

Is there a somewhat plausible cognitive science theory of mental states that would describe transformer LLMs?

The answer to this question is: yes.

Daniel Dennett has long been a proponent of the so-called “intentional stance” theory of minds. He proposes that mental states such as beliefs can only be discerned by taking a certain predictive strategy, which he calls the “intentional stance”. To take this stance, an observer considers the system, assigns some beliefs and wants to it, and then makes predictions on that basis. For example, you might take the intentional stance when trying to figure out why the neighbour’s dog is barking. You might postulate that that it believes that there is someone entering the place and wants to scare them away.

Having introduced the intentional stance, Dennett goes on to argue that any system

whose behavior is well predicted [from this stance] is in the fullest sense of the word a believer. (Dennett 1989, p. 15)

According to Dennett, there is nothing more to having a mind with beliefs than an observer taking the intentional stance towards the system, ascribing beliefs, and having reasonable success with this approach. If you can start to predict the barking of the dog based on your belief and desire ascriptions, your use of the intentional stance was successful. Next time someone enters the place, the dog barks again. Success!

The degree of success required to justify the attribution of mental states is debatable, because the strategy also has some success with thermostats. (I promised we would get back to the example.) After all, I can predict that the thermostat will turn up the heater by attributing to it the belief that it is 18° and the desire to have the room be at a temperature of 21°. But this is a questionable use of the intentional stance, because I can predict what will happen relatively simply by describing the thermostat as a mechanism.

Mechanical descriptions of dogs and Transformer LLMs, by contrast, quickly run into difficulties. They might not be impossible and important research has been produced (e.g. Voita et al. 2019, Dai et al. 2022, Geva et al. 2022), but the challenges in formulating them and the ease in prediction provided by the intentional stance justifies the ascription of beliefs. At least, that is what the intentional stance theory purports.⁴ According to Dennett’s intentional stance theory of mind, Transformer LLMs have minds.

To be clear, Dennett’s intentional stance theory is not widely accepted within cognitive science. It receives attention and might very well be presented in a standard philosophy of mind or cognitive science course, but is treated as a rather extreme view within the field. Dennett serves as a positive example showing that there is a somewhat plausible cognitive science theory ascribing a mind to Transformer LLMs, not more.

Like many others, I am more tempted by what is known in the philosophy of mind as “functionalism”. Functionalism a family of theories which judge whether something is a mental state by checking whether it has the required functional profile. For example, something is a pause button if it has the functional profile of stopping a relevant process when you push it. The button itself can be made of wood, metal, or even only exist graphically on a touchscreen. What matters is that it plays the right role in a larger system. Being a belief would then be similar to being an pause button. The states in the LLM or in the human brain have play the right role in the overall cognitive system to be beliefs.

Functionalism typically requires more than the kind of predictive success that Dennett declared the criterion for mind ascription. Whatever realises the mental states, be it wetware or hardware, has to fit a certain functional profile, not just lead to predictive success. The question, therefore, becomes what the relevant functional profiles are for mental states. What is their characteristic role in computational systems? These functional profiles would have to be very abstract if we seek to attribute beliefs both to humans and dogs, which after all have quite different cognitive capacities. In the case of belief, the profile would probably require some interaction with wants, so that the system acts on the basis of its beliefs towards achieving its wants. Whether Transformer LLMs fulfil this aspect of the functional profile is debatable. But one might propose a stricter functional profile and require

the candidate state to induce physical engagement with the world under suitable condition, or
to enable forms of reasoning which LLMs still do not reliably exhibit.

Once again, there is nothing close to a universally accepted answer on what the functional profile of beliefs is. But even if we could agree on the functional profile of belief states, then we would have to wonder whether having states with similar but not quite the same functional profile is sufficient for access consciousness and therefore being a mind. Perhaps Transformer LLMs do not have beliefs but almost-something-like-beliefs. If that were so, would that justify declaring them to be minds? How far off can the functional profile be before we draw the line? In light of such issues, I suggest we should ask whether the search for a sufficient condition for being a mind might be based on a misunderstanding of the semantics of MIND.

What kind of concept is the concept MIND?

My partial answer to this question is: A graded concept to be used with care.

At least so far, no one has found a convincing analysis of what it is to have the mental states that would characterise a mind. Certainly, no one has been able to settle the debates with a set of necessary and jointly sufficient conditions. That might also be because MIND does not allow for an analysis in terms of such conditions.

The concept of MIND resembles that of BIRD more closely than that of HYDROGEN. An atom is a hydrogen atom if and only if it has exactly one proton at its core. No such biconditional exists for BIRD. Instead we have core examples of what falls under the concept (a robin), further removed examples (a penguin), and as we go further down the spectrum to the archaeopteryx, we do not know at which point to draw the line. The concept of BIRD has something like a prototypical structure, where we have features that are typical for birds, but we have no clear cutoff for being a bird.⁵

Being a mind might be like being a bird. A computational system might be closer or further away from being a prototypical mind. Being arrogant enough to take the human mind as the prototype of a mind, a Transformer LLM is certainly far removed – and I will stress this below again – but this is not to say that they do not fall under the concept at all. Penguins do not fly; they do not live in trees; they eat fish instead of seeds. They are pretty far removed from the typical bird. Penguins are, nonetheless, birds.

By now I have justified my assertion that part of the correct answer to the original question is: It is complicated. It is complicated because we don’t know enough about minds and Transformer LLMs and it is complicated, because the semantics of the concepts allow some stretching. But I will end on a more definitive note. At least for now, I believe in most contexts it is not particularly useful to call LLMs minds.

When we think of minds, we think of humans and other relatively complex animals that

engage with non-textual objects in the world and manipulate them according to their wants,⁶
self-reproduce and have been subjected to long-term evolution,
are heavily multi-modal (e.g. have smell, touch, proprioception etc.),
are probably not trained by backpropagation,
spend a significant amount of their cognitive capacity on securing their energy source,
either don’t use language at all or use it as humans,
have phenomenal consciousness.

A standard Transformer LLM lacks these features, and it is therefore is usually not helpful to call LLMs minds. The point is not that these features are necessary conditions for having a mind – I have dismissed this for the case of phenomenal consciousness – but that they are characteristic for having a mind. This might change over time, the concept MIND might come to be recentered closer to Transformer LLMs as AI systems become widespread. For now, however, you and I are the more typical minds in the world.

Were I to discover the fossil of an archaeopteryx in my garden, I should not tell the museum that I have found some bird bones behind the house. There might be a way to construe the statement as true, but it is not particularly helpful. The same usually applies to calling Transformers minds.

I propose a broadly contextualist approach. If you sit in a philosophy seminar and try to list all existing kinds of minds, then it might be sensible to mention LLMs as an edge case. For making sense of LLMs as a tool, it is unhelpful, because human minds cannot be used as tools in the same way as Transformer LLMs. For analysing their inner working, it is unhelpful to describe LLMs as minds, because we need to look at more specific mechanisms for that purpose. For trying to predict their social impact, it is unhelpful to describe LLMs as minds, because adding Transformer LLMs to the world is not like adding human minds to the world. The similarities are not as relevant as the differences in most contexts.

My advice is to avoid calling an LLM a mind, unless you find yourself in one of the rare situations in which doing so helps to move forward the discussion. Personally, I’ll try to set this topic aside after this post and focus on more fruitful questions, such as how transformers deal with compositional meaning.

References

Block, N. (1995). On a Confusion About a Function of Consciousness. Brain and Behavioral Sciences, 18(2), 227–247.
Conia, S., & Navigli, R. (2021). Framing Word Sense Disambiguation as a Multi-Label Problem for Model-Agnostic Knowledge Integration. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 3269–3275.
Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., & Wei, F. (2022). Knowledge Neurons in Pretrained Transformers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8493–8502.
Dennett. (1989). The Intentional Stance. MIT Press.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [CS].
Geva, M., Caciularu, A., Wang, K., & Goldberg, Y. (2022). Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 30–45.
Nagel, T. (1974). What is It Like to Be a Bat? Philosophical Review, 83(October), 435–450.
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-maron, G., Giménez, M., Sulsky, Y., Kay, J., Springenberg, J. T., Eccles, T., Bruce, J., Razavi, A., Edwards, A., Heess, N., Chen, Y., Hadsell, R., Vinyals, O., Bordbar, M., & Freitas, N. de. (2023). A Generalist Agent. Transactions on Machine Learning Research.
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8, 842–866.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you Need. Advances in Neural Information Processing Systems, 30.
Voita, E., Sennrich, R., & Titov, I. (2019). The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4396–4406.
Weiss, G., Goldberg, Y., & Yahav, E. (2021). Thinking Like Transformers. Proceedings of the 38th International Conference on Machine Learning, 11080–11090.

Footnotes

I will neglect in this post the difference between Transformer LLMs being a mind and having a mind. As long as a mind is realised, I will consider it a positive answer to the present question. ↩
According to Ned Block, “a state is access-conscious (A-conscious) if, in virtue of one’s having the state, a representation of its content is (1) […] poised for use as a premise in reasoning, (2) poised for rational control of action, and (3) poised for rational control of speech” (Block 1995, p. 231). ↩
As far as I can tell, the best way we have to argue for this conclusion is to point towards physical correlates of phenomenal consciousness and suggest neural networks lack them. It is far from clear that this is a decisive strategy, partially because there might be multiple ways of bringing consciousness about, i.e. our human correlates for phenomenal consciousness might not be necessary. ↩
The commonly known problems of LLMs with factuality do not affect this result in the slightest, since the question is whether the system has beliefs, not whether these beliefs are correct. ↩
I say “something like”, because I want to avoid commitment to the entire theory of prototypes. I suggest that there are graded areas of falling under a concept, not that they are well-described by the distance to a single point in a semantic space, the prototype. ↩
I specifically deal with Transformer LLMs in this post, more generalist models, such as GATO (Reed et al. 2023) would require further discussion. I still doubt, however, that they are accurately described as having wants. ↩

Speculations about Transformers and Compositionality

Sun, 19 Feb 2023 10:08:13 +0000

Warning: Speculative Content. Expect that parts of it will be proven wrong.

The meaning of natural language sentences is compositional.
- The meaning of an expression \( \mathbf{E} \) syntactically derived from the sub-expressions \( \mathbf{E}_1, \mathbf{E}_2, \dots \) is a function of the semantic value of the sub-expressions Writing \( |\mathbf{E}| \) for the semantic value of the expression \( \mathbf{E} \), the compositional thesis is that: \( |\mathbf{E}| = f( |\mathbf{E}_1, \mathbf{E}_2, \dots | ) \) ¹
Transformers (Vaswani et al. 2017) do not correctly implement the compositional semantics of natural language cognition as found in humans agents.
- Human cognition includes a dedicated mechanism to derive the meaning of the expression \( \mathbf{E} \) from its sub-expressions compositionally.²
- There is no dedicated mechanism in transformers to derive the meaning of the overall expression \( \mathbf{E} \).
Transformers have to compensate for their lack of directly compositional language processing and partially succeed in this.
- Attention allows the transformers to partially compensate for lacking a compositional mechanism.
- The compensatory role of attention is part of the explanation why some attention heads reflect syntactic connections (see section 4.2.1 of Rogers et al. 2020).
The compensation mechanisms of transformers lead to over-contextualisation of later level token embeddings.
- The over-contextualisation is a partial explanation why transformer embeddings from earlier levels perform better on lexical semantic tasks (cf. Vulić et a. 2020).
Some limitations of transformers will be overcome by using a mechanism that reflects composition more directly.³
- A mechanism other than attention will be used.

For feedback, comments, and complaints, email me at david.strohmaier@cl.cam.ac.uk. Links to relevant research are appreciated.

References

Fine, K. (2007). Semantic Relationism. Blackwell.
Fodor, J. A., & Lepore, E. (2002). The Compositionality Papers. Oxford University Press, U.S.A.
Pylkkänen, L. (2020). Neural basis of basic composition: What we have learned from the red–boat studies and their extensions. Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1791), 20190299.
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8, 842–866.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you Need. Advances in Neural Information Processing Systems, 30.
Vulić, I., Ponti, E. M., Litschko, R., Glavaš, G., & Korhonen, A. (2020). Probing Pretrained Language Models for Lexical Semantics. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7222–7240.

Footnotes

I am using here the more general formula of Kit Fine’s (2007) Semantic Relationism. More commonly the sub-expressions are taken to contribute their semantic values in an atomic fashion, i.e. \( \vert \mathbf{E} \vert = f( \vert \mathbf{E}_1 \vert, \vert \mathbf{E}_2 \vert, \dots ) \) ↩
This has, to my knowledge, not been established yet. It has been argued for, in some form, by cognitive scientists such as Fodor & Lepore (2002), but my more optimistic assessment of neural methods is hard to square with these arguments. According to my understanding, empirical evidence from the level of cognitive neuroscience is missing or weak (e.g. Pylkkänen 2020). ↩
This claim does not concern whether dealing with compositionality in the absence of a new mechanism is possible, but instead concerns which development path will be taken due to differences in feasibility. ↩

10 Years of word2vec: Motivations and Success

Fri, 10 Feb 2023 13:32:13 +0000

Once in a while, a publication resets the literature. There is a clear before and after them, as researchers cite the new publications, while neglecting the earlier literature upon which they built. The word2vec papers by Mikolov et al., which have been published about a decade ago in 2013, are an instance of this.¹ As can happen with such papers, the original motivations became overshadowed by later applications.

In this post I will lay out those motivations were, how the embedding literature built upon them, and what made word2vec such an outstanding success.

I will not go into the details of the word2vec algorithms, since numerous blog posts have been written on this topic already. If you need a refresher, the following might be worth a look:

Word2Vec Resources (by Chris McCormick)
king - man + woman is queen; but why? (by Piotr Migdał)
The Illustrated word2vec (by Jay Alammar)

Why word2vec

Mikolov et al. motivated the word2vec algorithms as fulfilling the following goals:

Motivation: Go beyond representing words as atomic units.
Motivation: Introduce a way to measure similarity.

The first motivation was strongly associated with the idea of representation learning (cf. Bengio et al. 2013). Back then, the representations were often used by non-neural systems. We used word2vec embeddings with an SVM for sentiment classification during my MPhil studies. But with the progress of neural NLP, Seq2Seq models such as Transformers have become the standard. The representations inside those models are primarily used to explain their behaviour, and rarely taken as the primary object of research.

The second motivation had its source in the linguistic notion of a semantic space, which had been explored by computational linguists and NLP researchers for years prior to word2vec (see Erk 2012). These models had been largely motivated by theories of lexical meaning and attempted to implement them.

Analogies and Lexical Semantic Properties

Famously, word2vec was able to solve analogy problems, at least some of them.

For example, starting with the embeddings for CAR and DRIVER, one can calculate that for PLANE the analogous concept to DRIVER is that of PILOT. All one had to do was to subtract the vector of CAR from the vector for DRIVE and add the result to the vector of PLANE. The next closest vector would, if it worked, be that for PILOT, i.e.

\[ \vec{v}_{plane} + ( \vec{v}_{driver} - \vec{v}_{car} ) \approx \vec{v}_{pilot} \]

We can think of these analogies as capturing as relations, e.g. there is a relation that holds both between CAR and DRIVER as well as PILOT and PLANE.

The embeddings appeared to capture such relations in an intuitively interpretable way. In response, researchers sought to explain how these results came about (e.g. Levy & Goldberg 2014; Arora 2019; Hashimoto 2016), and then improve the linguistic quality of embeddings. Researchers engaged in that second endeavour argued that representing words as simple vectors failed to encode various lexical semantic properties, for example:

Polysemy: A single vectors does not appear to exhibit the variety of senses a word might carry.
Vagueness: OLD is a vague concept, and OCTOGENARIAN is not, or at least to a much smaller extent. A vector does not carry a specification of the vagueness in an obvious manner.
Taxonomical hierarchies: All dogs are mammals, but the vectors are not exhibiting such inclusion relationships.

The claim here is not that embeddings can never be used to detect polysemy or vagueness, e.g. by feeding them into a neural classifier, but that the vectors do not reflect such semantic properties in a straightforward way, similar to the way in which they captured analogies. Highly sophisticated approaches have been proposed, but the resulting models are often hard to train (for a survey and discussion see Emerson 2020).

Two Perspectives on Embeddings

In its focus on encoding semantic properties, parts of the embedding literature have deviated from the priority ranking of Mikolov et al. While Mikolov et al. referred to “meaningful regularities”, what mattered in the first instance was the downstream application. Vectors were not expected to reflect all regularities of word meanings.

Reconsidering Mikolov et al.’s original motivation as well as the literature it spawned, both a perspective emphasising downstream application and one focused on encoding semantic properties can be discerned. I will give one argument for each of the two views:

My argument for the first perspective is that word meanings as cognitive objects in human language users do not exist independently of other practical purposes. Both the word processing in human brains and the vectors we are concerned with have their role within larger computational processes (cf. Gauthier & Ivanova 2018). Encoding linguistic properties, such as taxonomical hierarchy, for their own sake would therefore not reflect human language cognition.

To support the second view, one can argue that human word meanings are general purpose, and that they achieve this status because they exhibit the semantic properties in question, e.g. taxonomical hierarchy or logical entailment. Accordingly, working towards encoding such properties brings us closer to improvements on many downstream tasks.

A problem with the argument for the second view is that the embeddings created to encode semantic properties have found little use so far. For example, region embeddings, that should be able to capture taxonomical hierarchies have not found a purpose in more application-oriented systems yet. By comparison, word2vec embeddings and contextualised embeddings based on ELMo (Peters et al. 2018) and BERT (Devlin et al. 2019) did so very quickly.

A key takeaway is that word2vec was able to reset the literature 10 years ago, because it made progress both in capturing linguistic regularities and in supporting downstream applications. With the exception of contextualised embeddings, such success has not been forthcoming since, despite many attempts. Whoever will find another way to make such combined progress, has a good chance of changing NLP history.

References

Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2019). A Latent Variable Model Approach to PMI-based Word Embeddings (arXiv:1502.03520; Version 4).
Bengio, Y., Courville, A. C., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8), 1798–1828.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805.
Emerson, G. (2020). What are the Goals of Distributional Semantics? Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7436–7453.
Erk, K. (2012). Vector Space Models of Word Meaning and Phrase Meaning: A Survey. Language and Linguistics Compass, 6(10), 635–653.
Gauthier, J., & Ivanova, A. (2018). Does the brain represent words? An evaluation of brain decoding studies of language understanding (arXiv:1806.00591).
Hashimoto, T. B., Alvarez-Melis, D., & Jaakkola, T. S. (2016). Word Embeddings as Metric Recovery in Semantic Spaces. Transactions of the Association for Computational Linguistics, 4, 273–286.
Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, 2177–2185.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013b). Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, 3111–3119.
Mikolov, T., Yih, W., & Zweig, G. (2013c). Linguistic Regularities in Continuous Space Word Representations. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 746–751.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237.

Footnotes

On Google scholar, one of the word2vec papers is cited more than 37000 publications, while an important source of that paper has a mere 1216 citations to its name. Similar disruptions in citation patterns have been used as a measure of scientific progress. ↩

Zotero BibLaTeX Style

Sun, 22 Jan 2023 12:00:00 +0000

I regularly use Zotero for managing my bibliographies. I then usually typeset my writings in LaTex, citing references with BibLaTeX.

For exporting BibLaTeX .bib files, I recommend the “Better BibTeX” extension. But sometimes I prefer to attach citations to my clipboard rather than have to export an entire .bib file. For BibTeX, there exists a bibliography style that allows Zotero users to do so. I have now start to adapt this file to BibLateX. You can find the result here.

The file is still work-in-progress and you should not rely upon it. If you find any problems with the style, feel free to email me or make a pull request.

Personal Reflections on 2022

Fri, 16 Dec 2022 21:08:13 +0000

Another year has passed and I want to take the opportunity to reflect on my research pursuits at a higher-level of abstraction. Hence, I will jot down notes on the lessons I cannot but take myself to have learned and sketch an intention of how to live as a researcher.

A Conclusion I’ve Drawn, Rightly or Wrongly

As a human being, it is hard to avoid drawing lessons from one’s life, even though one knows the data underlying this inference process to be limited and biased. On this occasion, I will indulge the impulse. One of the lessons I have drawn for myself is the following: Pursue a limited number of research project doggedly. As my eclectic CV indicates, this is a lesson I learned the hard way.

I already hinted at this lesson in 2020. Back then, I pointed out a danger of this lesson: the danger of ending up in a blind alley. Some research paths lead nowhere, no matter the determination with which one follows them. In some cases, there is little to be done about that, because the universe does not signpost the paths to its secrets. Research is usually a gamble. Often, however, a clear-eyed look at where one’’s research efforts have led so far allow to infer that they will not lead to pastures bearing sweeter fruits. That does not necessarily stop those who have already committed them to the path. Untold numbers of academics have ended up spending their life in such a manner, many even enjoyed it.

Occasional reflection is supposed to fend this danger off. Pursue a limited number of research projects doggedly, but once in a while step back, to reconsider them. One can consider a research project from various distances, and this post serves for reflection of the largest distance, that is of the greatest abstraction: research as a choice of what to spend a life one. So let it be recorded that, from this distance, I am still optimistic with regard to the path I have chosen.

Research into lexical acquisition in human agents and NLP models remains promising. There is clear progress in NLP, widely advertised, but the progress is not well understood and clearly patchy. For example, how transformers cope with the compositionality of lexical meaning has, as far as I am aware, not yet been explained. I have no reasonable doubt that research in this area will push the epistemic frontier forward – and I want to contribute to it.

Why Writing Such Reflections Infrequently Is a Good Idea

As already mentioned, I have written another post like this back in 2020, but then I skipped 2021. Instead I started 2021 with a post on Prolog, which has probably been my most successful blog post so far, in terms of engagement but also in what I was able to learn from the results. Generalising from my limited blogging experience, writing about first-order interests, such as a neglected programming language, has proven more productive and more in line with my own goals than obscure ruminations about my research path.

Second- and higher-order thoughts, such as the reflections in the present post, serve to correct our first-order pursuits or increase their efficiency,[0] but they can become their own pursuit that distort our behaviour. If I intend to live my life as a researcher, then not for the sake of writing about this life. I live it for the sake of the epistemic progress brought about by this research. Any post like the present one should be no more than the rare exception. Special events, such as the passing of year, provide a limited occasion to engage in such exceptional behaviour. This post fills that role for this year, and now it is done.

Onward, for scientific progress!

Footnotes

[0] What is they justification for this assertion of purpose? Ah, there is the rub. That is a philosophical question, I’ll leave for another day.

Transformers and the Brain: Literature Notes

Mon, 01 Aug 2022 17:00:13 +0100

Introduction

Neural networks with Transformer-architecture remain the state of the art in natural language processing (NLP). For many tasks the first approach is to throw some version of the BERT model (Devlin et al. 2019) at it – a practice I’ve participated in (Yuan et al. 2021a, 2021b). The success of the Transformer-architecture has raised the question how such models compare to language processing in the human brain and a literature is growing around this question. In this post, I collect notes on selected papers which try to map representation in Transformer models to brain data. First I’ll list a few conclusions from the literature and then move through the selected papers to substantiate the conclusions and make further points.

The main conclusions are:

The Transformer-architecture is better than previous RNN architectures. That is, the mapping of Transformer models to brain data allows to predict more of it than if one uses an RNN architecture, typically LSTMs or GRU networks.
Word prediction performance matters, but is not everything. The capacity for predicting the next word given an incomplete sequence does not explain all that is special about Transformers.
We do not know why the Transformer-architecture performs so well, but semantics might play a role.
We need better brain data.

Human Sentence Processing: Recurrence or Attention?

This paper by Merkx and Frank (2021) explicitly compares GRU-RNN to Transformer models. They implement these models themselves and make them comparable, e.g. the total number of parameters are relatively close. The models are trained on the next-word prediction task. They are evaluated on

self-paced reading (SPR)
eye-tracking (ET),
and electroencephalography (EEG) data.

The top-line result is that even controlling for performance as a language model, i.e. being able to predict word tokens, Transformer models tend to do better, specifically on the SPR and EEG datasets.[0] Something about the architecture other than its ability to capture statistical information about word distributions appears to make it especially well-suited for predicting brain performance.

The authors show themselves surprised by the superior performance of the Transformer-architecture, because they “considered the Transformer’s unlimited memory and access to past inputs implausible given current theories on[sic] human language processing”. (p. 18). While the author are not giving up this view and therefore remain more sceptical than the authors of other papers I’ll mention, they consider the possibility that Transformers capture something about human language cognition. Specifically, they entertain that the attention-mechanism resembles cue-based retrieval, but since they do not provide much details on this hypothesis and I do not feel confident evaluating it.

Brains and Algorithms Partially Converge in Natural Language Processing

Caucheteux and King (2021) look at Transformer models and ask how the performance of such models on a word prediction task[1] and predicting brain measurement relate. The key findings are:

Performance on predicting words strongly correlates with predicting brain scores.
The relationship breaks down at the upper end of next-word prediction performance, that is the best models the authors have trained for word prediction do somewhat worse predicting brain scores. This suggests that Transformer models start to overfit to the word-prediction task to the detriment of being able to predict brain measurements.

Different Kinds of Cognitive Plausibility: Why Are Transformers Better than RNNs at Predicting N400 Amplitude?

Similarly to Merkx and Frank, Michaelov et al. (2021) compare RNNs and Transformer models in how well they can predict brain data, in this case the N400 amplitude (EEG study). They used an already existing LSTM model and GPT-2. In contrast to the experiments by Merkx and Frank, the models differ in many ways other than the difference between RNN and Transformer, e.g. vocabulary size and number of parameters.

The paper also shows that the Transformer model does better at predicting the human brain data than its RNN competitor. Additional experiments suggest that part of the reason GPT-2 does better is that the cosine similarity feeds more into the surprisal of the model. Taking cosine similarity as a measure of semantic similarity, the authors hypothesize that ‘bag-of-words’ semantic activation may be part of the neurocognitive system that is measured by the N400 amplitude. But this claim is again to be considered speculative.

The Neural Architecture of Language: Integrative Modeling Converges on Predictive Processing

This paper by Schrimpf et al. (2021) offers one of the most encompassing comparisons across model architectures and datasets. Without going into all the details, GPT-2 stands out as the best model.

The authors replicate the finding that performance on next-word prediction predicts performance on predicting brain measurements. Importantly, the authors compare the next-word prediction task with tasks from the GLUE benchmark and find that these do not predict brain scores.

The paper also test whether the model architecture matters by computing brain scores for models with random weights.[2] The authors show that even under such conditions some models achieved noteworthy correlation. The Transformer architecture alone seems to do some of the work.

The paper is perhaps the most optimistic one when it comes to ability to Transformers to predict brain data. On some datasets, the authors come to the conclusion that Transformer models reach noise ceiling, i.e. that the model does as good as possible. One dataset, however, remains very challenging: The Blank 2014 dataset consists of fMRI measurement where the stimuli are auditorily presented stories. Both the larger narrative context of stories and the auditory transmission stand out.[3]

The authors on this paper suggest a convergence between neural model in NLP and cognitive science, since (next-)word prediction is a key task in NLP and predictive processing holds increasing sway in cognitive science. While the authors comparison with the GLUE tasks is suggestive in this regard, I am not yet sold that we see a proper convergence. The tasks humans did might be biased towards the next-word prediction (with perhaps the exception of Blank (2014), where the models did worst). Furthermore, I would not be surprised if the data from the GLUE benchmark are not as reliable as those for next-word prediction since they rely on challenging annotation by experts, hence the network might start to model noise to a great extend.

Be that as it may, a convergence on prediction would not explain why the Transformer-architecture performs so well on both standard NLP tasks and predicting cognitive measures. LSTMs have also been trained on next-word prediction but do not perform as well. To explain the role of the Transformer-architecture, the authors point (amongst other things) towards the role of smoothed multi-scale processing and propose that this might capture something about language structure, but this discussion is merely suggestive.

Coming from NLP rather than neuro-science, this paper also made clear to me that we need better brain data. The noise ceilings estimated by the authors, that is their estimate for how well brain measurements can be predicted in general, are rather low. Accordingly, much of the brain measurements is treated as individualised noise. The authors suggest that such a low ceiling might be due to language processing occuring on high level of cognition where the brain processing might not be stimulus-driven but top-down. As a result, there might just not be one pattern across individuals to predict. That seems speculative to me and better measurement might help raise the ceilings and thereby

Conclusion

I’ve already listed above the conclusions I’ve drawn from this emerging literature. The papers indicate a clear direction: Transformer models do well at predicting brain measurements, usually better than RNNs, and the architecture plays a role. Why they are doing better remains unclear, with multiple hypotheses being considered. It is intriguing that both the hypothesis by Merkx and Frank (cue-based retrieval) and Michaelov et al. (‘bag-of-words’ semantic activation) have a semantic tendency, i.e. Transformers are taken to do better because they capture something about semantic processing in the brain. But these discussions remain mostly suggestive, with the experiment by Michaelov et al. concerning the predictive power of cosine distance being the strongest piece of evidence, as far as I can tell, and that is not paricularly strong evidence since the cosine distance doesn’t necessarily just concern seamntics. Without a better understanding of language processing in the brain, it might prove difficult to reconstruct why the Transformer-architecture performs so well. Even worse, without better understanding of the human brain, it will become increasingly difficult to compare neural architectures in this way.

Footnotes

[0] I don’t understand why this literature is so averse to publishing tables. Graphs are good, but being able to check against a table of data provides a way to test whether one has truly understood what is going on.

[1] From the paper, it is not entirely clear to me whether the next word or a randomly masked word has to be predicted.

[2] There is still a linear model trained on top of the randomly initialised models.

[3] The Futrell 2018 dataset used by the authors is also story-based and the Transformer-model does better at predicting it, but it consists of self-paced reading data instead of brain measurements.

References

Blank, I., Kanwisher, N., & Fedorenko, E. (2014). A functional dissociation between language and multiple-demand systems revealed in patterns of BOLD signal fluctuations. Journal of Neurophysiology, 112(5), 1105–1118.
Caucheteux, C., & King, J.-R. (2022). Brains and algorithms partially converge in natural language processing. Communications Biology, 5(1), 1–10.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs].
Futrell, R., Gibson, E., Tily, H. J., Blank, I., Vishnevetsky, A., Piantadosi, S., & Fedorenko, E. (2018, May). The Natural Stories Corpus. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Merkx, D., & Frank, S. L. (2021). Human Sentence Processing: Recurrence or Attention? Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, 12–22.
Michaelov, J. A., Bardolph, M. D., Coulson, S., & Bergen, B. K. (2021). Different kinds of cognitive plausibility: Why are transformers better than RNNs at predicting N400 amplitude? ArXiv:2107.09648 [Cs].
Schrimpf, M., Blank, I. A., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N., Tenenbaum, J. B., & Fedorenko, E. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45), e2105646118.
Yuan, Z., Tyen, G., & Strohmaier, D. (2021a). Cambridge at SemEval-2021 Task 1: An Ensemble of Feature-Based and Neural Models for Lexical Complexity Prediction. Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), 590–597.
Yuan, Z., & Strohmaier, D. (2021b). Cambridge at SemEval-2021 Task 2: Neural WiC-Model with Data Augmentation and Exploration of Representation. Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), 730–737.

The Unmasking of Dictionaries by Strong Opinion

Tue, 31 May 2022 17:00:13 +0100

Disclaimer

I am currently funded by money from Cambridge University Press and Assessment, which also stewards various English dictionaries. All opinions in this post are distinctly mine.

The Unmasking of English Dictionaries (CUP, 2018) is on a mission to change lexicography forever and on the way it tries to insult as many lexicographers as possible. Its author, the linguist R. M. W. Dixon, has a chip on his shoulder. Dictionaries of the English language are all wrong. Their creators misunderstand what a dictionary is for, and are, in general, lazy plagiarists. Surprisingly, the book is not just entertaining, but also makes intriguing suggestions, even though some of the arguments for them have serious gaps.

Dixon repeats again and again that the purpose of dictionary is to “tell you when to use one word rather than another” (p. ix). That is the premise, and on its basis Dixon discusses the shortcomings of dictionaries:

They treat words in isolation, rather than contrastive in their semantic field
They rely excessively on definitions
They neglect to provide grammatical information that is required for correct word use

The examples Dixon gives suggested that there is at least something to his diagnoses. He proposes that the problem be solved by the construction of a new dictionary organised semantic fields. The entries for these fields would compare the usage of the words contained in it, including the grammatical constraints on this usage. For illustration, the book contains a few sketches of such comparative discussions of lexical semantics, e.g. the field including “want”, “wish”, “desire”.

One might, however, wonder about the correctness of Dixon’s premise that the main purpose of dictionaries is to enable the choice between words to use. I don’t think its true, at least descriptively. Dixon takes a distinctly productive task as the purpose for a dictionary, choosing one word to use over others. I expect, however, that much of the use of dictionaries is receptive. My expectation is that dictionaries are most frequently consulted when one is stumped by previously unfamiliar word in a text one tries to comprehend. Of course, that is speculation on my part, but so is Dixon’s claim that the purpose relates to productive use of English. And that leads us to the heart of the problem, Dixon does not sysematically engage with users of dictionaries – others than himself, that is – even though the whole point was to propose a new type of dictionary that is better suited for the needs of its users.

After another swipe against lexicographers as lazy copyists, Dixon proposes the following procedure for producing a dictionary (p. 25-26):

Select sets of related words
Consult corpora to compare and contrast those related words
Work out a conceptual template for the sets of words using “critical notions”
Only at the last step should one compare with other scholars and dictionaries.

Dixon’s proposed procedure does not include users at any point. There are no user studies, not even a step to incorporate informal feedback. The goal is to work “form first principles and with a fresh viewpoint” (p. 25). These first principles might please a linguistic expert such as Dixon, but surely the average dictionary user has different needs and these should be assessed in the process of constructing a dictionary.

The lack of considering actual users and their needs also shows up in another assumption by Dixon, namely that contrastive but relatively abstract outlining of different usage patterns is sufficient to help with the choice between words. Dixon would have dictionaries present everything that is required for choosing between words. That includes a lot of rather abstract linguistic information. The distinction of different types of clauses might quickly overwhelm a learner who just wanted to understand what “hanker” meant in a text they were reading, and while Dixon does envisage the usage of sentential examples, the theory-driven contrasting of words in a semantic set comes first as the organising principle (see p. 227).

I would also like to add that for someone who emphasises “first principles”, Dixon does not spend much time on actually laying out his theoretical framework for lexical semantics. From Dixon’s approach, I would assume that he endorses some sort of lexical relation/frame semantics, but it is not obvious that these approaches correctly reflect word senses as they are cognitively encoded. Surely the first principles for organising a dictionary would be principles that reflect the entries in our mental lexicon? The problem here might be that Dixon does not think highly of many efforts of investigating “the role of language in human cognition” (p. 192). Although my main criticism is that the focus on linguistic “first principles” is to the exclusion of empirically assessing dictionary user needs, it might be noted that even the claimed “first principles” are not exactly fast foundations (using the word “fast” here in the antiquated secondary sense discussed on p. 131-134).

Dixon’s book has great entertainment potential, especially for those of us who enjoy academic philippics. As is common for this text genre, the positive argument reveals holes upon closer expectation. Dixon’s assumption should be considered expert guesses about dictionary use, but guesses they remain. That being said, investigating Dixon’s proposals in actual user studies might be of great interest. The results could show to which extent lexicography really needs to be reborn.

We Know So Little

Sun, 15 Aug 2021 22:30:13 +0100

Will machine learning (ML) solve natural language understanding (NLU)? A recent essay in The Gradient by Walid Saba argues that it won’t. I lack the confidence for either affirming or denying that ML will lead to NLU, especially without much further explanation of what we understand ML and NLU to be, but I am confident that Saba’s arguments are not of the knock-down kind.

A part of me would like Saba’s arguments to succeed. While I mostly work with neural nets and other ML methods, I have a soft spot for symbolic approaches to NLP. I have read and enjoyed Representation and Inference for Natural Language and I am an avowed admirer of Prolog. When I got into NLP, I began by reading Chomsky’s Syntactic Structures, only later did I read Neural Network Methods for Natural Language Processing. Certainly, I am the kind of person Saba’s argument should appeal to, and yet… and yet I can’t say it wins me over. At the end, I’m left unsure, not knowing whether ML will solve NLU.

In this post, I won’t try to cover all arguments from The Gradient essay. For example, I won’t cover what Saba says on intensions, other than to frankly admit being puzzled by his claim that ML is all about extension. I’ll leave those argument to others. Instead, I’ll argue that we just don’t know enough about how language fundamentally works to adjudicate whether ML can solve NLU.[0] To make this argument, I pick out one of Saba’s claims about language and argue that the situation is more complicated. The claim I will take offense with is the following:

[…] language understanding does not admit any degrees of freedom. A full understanding of an utterance or a question requires understanding the one and only one thought that a speaker is trying to convey.

According to Saba, when we speak we are trying to convey one determinate thought, that is, a thought with a determinate content. The understanding of this content does not admit any degrees of freedom. As I interpret Saba, there is a matter of fact whether one correctly understands the other person or not and this fact either obtains or it does not. It doesn’t hold in degrees, only absolutely.

In response, one might be tempted to point to examples where people are misunderstanding in degrees. If a speaker utters the sentence “The train is late” and one listener misunderstands it as meaning that the train will not arrive today at all and another listener misunderstands it as meaning that bananas are straight, then both are misunderstanding the sentence but the second listener is doing worse. As the example, one can misunderstand someone else more or less badly. But Saba can accept that one can be wrong in degrees, because his point is only that full understanding does not admit any degrees of freedom. There might be many ways of doing it wrong, but there is only one way of doing it right. According to Saba, when we understand each other, there is one and only one thought with a determinate content to understand for each utterance. That is a much more plausible position, nonetheless, I will disagree with it.

Consider Saba’s own example of an utterance:

Do we have a retired BBC reporter that was based in an East European country during the Cold War?

For the sake of illustration, assume that I utter this question as a member of a network of experts and that I want to know whether we, the network, include such a person. Saba suggests that I am expressing one determinate thought, that there is one correct analysis of my utterance, which an NLU system should produce. According to him, there is no degree of freedom in this analysis. I disagree, or at least I see good reasons for disagreement.

As Saba states, understanding the exact thought of the question requires interpreting the phrase “retired BBC reporter”. This interpretation, however, turns out to be much harder than his gloss “the set of all reporters that worked at BBC and who are now retired” suggests. To see the problem, assume that in response to my question, someone asks me whether I intended to include freelance reporters who worked for the BBC or only its employees. The honest response to this question might very well be that I don’t know. I don’t know whether I meant to include freelancers in the extension of “BBC reporters” or not. Of course, I can make it up on the spot now, but I cannot decide the difference with regard to my prior intentions.

Contrary to Saba, the difference I cannot decide is semantically significant. It might be that a former BBC freelancer meeting the description belongs to the network, but no employee BBC reporter does. Whether my question is to be answered affirmatively depends on a difference in phrasal meaning that 1) I do not know how to resolve, 2) I do not know whether I intended to resolve it all when I uttered the sentence.[1]

It seems that there is not one determinate content I sought to express.[2] There are at least two propositions that seem to fit my intention. But you might disagree and suggest that I intended to express one specific determinate proposition, I just don’t any longer know or never knew which one. In other words, instead of denying the determinacy of intended thought, you deny the epistemic access to the determinate intended thought. This suggestion seeks to rescue Saba’s argument with an epistemic move.

I don’t know whether the epistemic move itself can be pulled off – do I really lack this introspection? – but I am confident that, in any case, it won’t achieve the argumentative goal. It cannot rescue Saba’s argument, because if I don’t have access to my determinate thoughts, you certainly don’t either. Even if one of the two interpretations is the truly correct one, you at best have approximately correct access to it. Yet, you have NLU, you understand natural language as well as any other human. You would have NLU without access to the one true thought, human-level NLU rather than super-human-level NLU. If Saba’s arguments only showed that ML can lead to no better NLU than human-level NLU, then those working on ML-based NLU won’t be all that worried.

My overall argument does not depend on whether I am right in the final analysis. Maybe I intended to utter one determinate thought and maybe it is accessible to humans. Even if this were so, we do not know it. What matters is that Saba’s assumption is not safely established. We do not know that

[…] language understanding does not admit any degrees of freedom. A full understanding of an utterance or a question requires understanding the one and only one thought that a speaker is trying to convey.

We know too little about the foundations of language.

Footnotes

[0] In my argument, I’m applying relatively high standards of knowledge. - Different standards for knowledge? See David Lewis’ paper Elusive Knowledge - By denying that we have knowledge about how language fundamentally works, I am not denying that we have theories about it and I am not even ruling out that one of these theories is largely correct. I am, instead, suggesting that no theory of language and our understanding of it reaches the level of certainty Saba presumes.

[1] This state of affairs differs from the missing text phenomenon, the fact that we do not express the fullness of our thoughts in utterances, that Saba happily acknowledges and makes argumentative use of. In my example, I’m not just leaving part of my thought unsaid because the part can be derived from my fragmentary statement together with common knowledge. Otherwise, I would myself be able to recover the left out part.

[2] That claim resembles, of course, Quine’s indeterminacy of translation. That being said, I am not sure what to make of Quine’s position, because I am not sharing his behaviourist assumptions and I do not know whether his position can be defended without them.

On the State of Analytic Philosophy

Wed, 07 Jul 2021 17:30:13 +0100

A debate about the state of analytic philosophy has been developing in the philosophical blogosphere over the last few months, started by Liam Bright’s pessimistic assessment of the state. In this original post, Bright described analytic philosophy as a “degenerate research programme”. No longer was there a shared paradigm, and instead philosophers either took a politically applied turn or just bumbled along not knowing what else to do.

Bright summarises the situation, by describing a threefold lack of confidence:

Lack of confidence that analytic philosophy can solve its own problems.
Lack of confidence that analytic philosophy can be modified so as to do better.
Lack of confidence that the problems are worth solving in the first place.

Overall, I found myself largely agreeing with the pessimistic sentiments expressed in Bright’s post; otherwise I presumably wouldn’t have switched fields. That being said, I am modestly more optimistic on 1 and 2, as should become clear later in the presentation of my perspective.

The Debate

Bright’s post has started a debated, which has underlined for me how different the various sub-groups of academic philosophy are. Representatives of different areas in philosophy give different responses and many disagree with Bright more than I do.[0] One example of that is the recent guest post by Preston Stovall on Bright’s blog.

Stovall criticises the “march of Kripke” narrative in Bright’s original post, i.e. the narrative that sees Kripke’s work as the high point of the analytic tradition on which the later work relied, and by changing the narrative Stovall suggests a vision for a unified analytic philosophy. While I am vaguely familiar and attracted by the narrative that Stovall sketches, I cannot say that I recognise much from my own philosophical-academic experience and work in it. To tell the truth, I can recognise as a distinctly Pittsburghian approach to analytic philosophy, rather than the one I am used to. Perhaps this Pittsburghian view can take over, and unification be achieved behind another tradition with in analytic philosophy. For now, however, a diversity of viewpoints prevails.

To add to this diversity, I want to present my own perspective, that is the perspective of one particular person who has turned to computer science out of dissatisfaction with philosophy’s current state. Hence, my post will be unabashedly self-centred, focussing on three of my own qualms with academic analytic philosophy:

Dissatisfaction with the methods used.
Lack of interest in the questions of the applied turn.
Lack of opportunities for career advancement.

Dissatisfaction with Methods

My dissatisfaction with the methods of analytic philosophy is an instance of the one described by Bright. I share the lack of confidence that analytic philosophy as it stands can solve its own problems and it troubles me deeply. But not all find this prospect of unsolved problems so dismal. As Bright summarises, one common response to his diagnosis of the lack of suggests that

while it is true that philosophers generally cannot plausibly believe they will achieve rational consensus, this is not such a bad thing. The mistake was ever hoping for that in the first place, and once we have gotten over that hangup we can enjoy the sort of pluralistic free play of ideas that comes with a taste for dissensus.

I find the lack of ambition in this response deeply unappealing. We should aspire to solve our problems, or at least to make substantial progress towards such solutions. The rationales for the lack of ambition do not convince me. They seem to turn on the nature of philosophy and suggest that it is an open-ended discipline that can reach no conclusions at all. As is to be expected, I disagree with this view and while settling the nature of philosophy is beyond the reach of a blog post, outlining the difference is not.

It might be true that philosophy constantly raises new questions. But there is a need to distinguish whether philosophy as a discipline can always raise new unanswered questions, and whether we can answer current questions in philosophy.[1] To give an example, I have published on the nature of social groups and I want philosophers to reach a rational consensus on this issue. Are groups pluralities or not? The common response appears to suggest that we might clarify this question itself, but never quite answer it.

I am not as pessimistic as those who respond with accepting the problems as unresolvable. In contrast to them, I hold out a modicum of hope that one by one, we could reach widespread consensus in philosophy.[2] Undoubtedly it will be challenging and it might be a never-ending quest, since we are never running out of new questions. Still that is a far cry from the pessimism of being unable to reach consensus on any of them. In fact, I believe it makes me even more optimistic than Bright, who apparently does not dare to hope for a methodological renewal, at least not one that leads to true problem-solving.

That being said, I am to be counted amongst the pessimists insofar I believe in the need for a far-reaching methodological change, crossing disciplinary boundaries, and do not see such change happening at the moment. My pessimism is sustained by a folk-sociological assessment, not by one of philosophy’s nature.

Applied Turn

Analytic philosophy in the US and the UK has undergone a sharp turn towards socio-politically hot topics, such as racism and gender. Before I turned to computer science, I primarily worked in social ontology, a sub-field of philosophy in which the applied turn has been especially notable. That is not entirely surprising, since the applied turn has focused on social issues. But my experience of it has differed from that described by Bright:

Many of the projects that seem most exciting to junior philosophers concern injustice, oppression, propaganda, ideology – all things about which it is felt that philosophical analysis might be able to have a real world impact.

I am one of the junior philosophers for whom that was not the case. These projects did not excite me. Similar to Bright, I remain sceptical about the ability of philosophers to have “real world impact” in this way. But even if one were to grant that philosophy can have the intended impact, there are subtle and not-so-subtle differences in my political view and the one hegemonic in the applied turn. These differences give the applied turn a direction that does not suit me and what I want.

As I did not share the political sentiment of the applied turn, I mostly stayed away from it. Over the years, however, its influence in social ontology increased and crowded out the issues I was more interested in, such as the ontological foundations of the social sciences. While my interests still form part of social ontology and are recognised as such, the crowding-out effect matters especially on a difficult job market, where opportunities for career advancement are scarce.

Lack of Career Opportunities and Conclusion

I take a rather naive view on the issue of the philosophy job market. Of course, one can bemoan the state of funding for the humanities and on some days I have sympathies for such a take – usually on days when my attention is not on the relative scarcity of GPU time. But bracketing this issue, there are too many people for too few positions. A solution is for people to leave the academic discipline. From my perspective, this does not happen to a sufficient degree for two reasons:

Lack of confidence in one’s own skills
Exaggerated attachment to philosophy as an academic field

Many philosophy graduates believe they are only suited for philosophy, underestimating their skills and most importantly their capacity to acquire further skills. It is my experience that with few exceptions, notably students working in cognitive science, philosophy graduates tend not to think of themselves as people who can program. The large majority of them most certainly could and they could earn income with this skill. If they gave programming a try, they might also find it intellectually rewarding.

The second claim rests on my impression that philosophers identify deeply with philosophy as a field. You do not just study philosophy, you are a philosopher. Certainly, I have felt that ways and to an extent still. It is deeply appealing to identify with such a rich tradition of more than 2000 years, a tradition that has brought about many great insights. But that identification does not imply that one has to secure a position in academic philosophy. As is well-known, few of the most famous philosophers of history worked as academics. In addition, philosophy can also be pursued from positions in other fields.

Of course, I solved my problems by moving into another field, computer science. My actions and the view expressed in this post are coherent. Furthermore, my choice of computer science addresses all three issues I have raised. While analytic philosophy took a socio-politically applied turn, I instead chose an implementational turn. This implementational turn, so I hoped and still do, could help us overcome the methodological impasse in asking the questions of philosophy. Of course, computer science also offers more opportunities for career advancement. I can always sell my Python, NLP, and Deep Learning skills on the job market.

Clearly, I am advertising the solution of switching to computer science, but I don’t think this post will convince many. That one can earn more with a degree in CS is well-known and widely accepted, my methodological claim and its argumentational justification would have to do the work of changing minds. But because I have not sufficiently argued for how the methodological impasse of philosophy could be overcome by the methods of computer science, my post lacks argumentative power. At this point, I doubt I have an argument that can do the work. Luckily, I only promised to offer my perspective in this post.

Footnotes

[0] For more of the debate, follow these links to various blog posts: 1, 2, 3.

[1] A connecting assumption is a deep holism about philosophical questions, as expressed in this blogpost:

Any philosophical problem is all philosophical problems. You will have known nothing if you have not known everything.

I don’t want to dismiss such assertions entirely, but notice that asserting it leads to an odd incoherence. As far as I am concerned, the question of holism about philosophy is a philosophical question. To know the positive answer to it, that is to know that any philosophical problem is all philosophical problems, would therefore require us to know everything.

[2] I am not ruling out that some problems in philosophy might not be solvable. There might be a lack of epistemic access of one sort or another. But I do not think that we have justification to act on this assumption with regards to all or most philosophical problems. The background of my position is Peircean, taking hope in success as an important factor for the progress of inquiry.

Using Prolog for Sudoku Variants

Fri, 02 Jul 2021 22:30:13 +0100

The Sudoku scene has undoubtedly been one of the pandemic winners. Thanks to the Youtube channel “Cracking the Cryptic”, its viral video on the “Miracle Sudoku”, and the many entertaining videos that followed, Sudoku puzzles with extended rule-sets have received widespread attention. That is a prime opportunity for Prolog aficionados like myself to show off the power of the language. Many Sudoku puzzles are easily solved with Prolog.

Existing Resources

A solver for standard Sudokus is a teaching example for the CLPFD library. The Power of Prolog has a dedicated page and video for solving standard Sudokus. Puzzles with extended rule-sets have not gone unnoticed either. In fact, the original “Miracle Sudoku” video has been discussed and solved with Prolog in a blog post by Benjamin Congdon. I want to add a little to these solvers.

The extended solver of Congdon adds three constraints to the classical solver:

King’s Move: Cells that are removed from each other by the equivalent of a move of a chess king cannot contain the same digit.
Knight’s Move: Cells that are removed from each other by the equivalent of a move of a chess knight cannot contain the same digit.
Orthogonal Adjancency: Orthogonally adjacent cells cannot contain consecutive digits.

If you want to see how to program these constraints, see Congdon’s post. But there are other constraints that often appear on Cracking the Cryptic and I thought I would fill the gap. For a start, I want to address one of the most common constraint type:

Thermo: Numbers on a line are montonically increasing starting from a thermometer bulb.

Full Solution: Thermo

For the Thermo constraint, I’ve chosen the great “Spoons” puzzle by the well-known setter Phistomefel. To solve that puzzle yourself, follow this link. To solve it with Prolog, all we need beyond a standard solver are the following the two lines and the inclusion of the specific constraints:

smaller(L,Sn,L) :- Sn #< L.
thermo([L|Ls]) :- foldl(smaller,Ls,L,_).

The thermo predicate defined in these lines, checks whether a list of integers increases monotonically from left to right.[0]

My complete solution, based on the previous solvers metioned above, looks as follows:

:- use_module(library(clpfd)).

puzzle(Rows) :-
	Rows = [
		[A1,A2,A3,A4,A5,A6,A7,A8,A9],
		[B1,B2,B3,B4,B5,B6,B7,B8,B9],
		[C1,C2,C3,C4,C5,C6,C7,C8,C9],
		[D1,D2,D3,D4,D5,D6,D7,D8,D9],
		[E1,E2,E3,E4,E5,E6,E7,E8,E9],
		[F1,F2,F3,F4,F5,F6,F7,F8,F9],
		[G1,G2,G3,G4,G5,G6,G7,G8,G9],
		[H1,H2,H3,H4,H5,H6,H7,H8,H9],
		[I1,I2,I3,I4,I5,I6,I7,I8,I9]
		],
    sudoku(Rows),
	thermo([A3,A4,A5]),
	thermo([B2,C2,D2]),
	thermo([B3,C3,D3]),
	thermo([B4,C4,D4]),
	thermo([B5,C5,D5]),
	thermo([B7,C7,D7]),
	thermo([B8,C8,D8]),
	thermo([B9,C9,D9]),
	thermo([E3,E4,E5]),
	thermo([F1,G1,H1]),
	thermo([F3,G3,H3]),
	thermo([F4,G4,H4]),
	thermo([F6,G6,H6]),
	thermo([F7,G7,H7]),
	thermo([F8,G8,H8]),
	thermo([F9,G9,H9]),
	thermo([I3,I4,I5]),
	thermo([I8,I7,I6]).

sudoku(Rows) :-
	append(Rows, Vs), Vs ins 1..9,
	maplist(all_distinct, Rows),
	transpose(Rows, Columns),
	maplist(all_distinct, Columns),
	[As,Bs,Cs,Ds,Es,Fs,Gs,Hs,Is] = Rows,
	blocks(As, Bs, Cs),
	blocks(Ds, Es, Fs),
	blocks(Gs, Hs, Is).

blocks([], [], []).
blocks([N1,N2,N3|Ns1], [N4,N5,N6|Ns2], [N7,N8,N9|Ns3]) :-
    all_distinct([N1,N2,N3,N4,N5,N6,N7,N8,N9]),
    blocks(Ns1, Ns2, Ns3).

smaller(L,Sn,L) :- Sn #< L.
thermo([L|Ls]) :- foldl(smaller,Ls,L,_).

:- time((puzzle(Rows), maplist(labeling([ff]), Rows))),
	maplist(portray_clause, Rows).

The solve took 0.141 seconds on my laptop.

Other Constraints

To show off the power of Prolog a little more, I’ll finish with the implementation of two more constraints.

Summing constraints are equally straight forward to handle. There are in fact multiple variations of summing constraints, including summing along arrows and summing along diagonals (little killer clues). The code will usually be the same:

add(X,Y,S):- S #= X+Y.
sum(Xs,S):- foldl(add,Xs,0,S).

The predicate sum relates a list of integers – order does not matter – to its sum S. When we implement a Sudoku puzzle, the S will usually be another variable in the case of arrow clues and in the case of little killer clues, it will usually be a given digit.

Disjoint groups are a further fascinating constraint. It is defined as follows:

cells with the same position in 3x3 boxes contains number from 1 to 9 i.e no number can repeat in the same position in 3x3 boxes.

I wrote a working implementation for the disjoint group constraint and I post it here for completeness, but it is not very elegant.

disjoint(Rows) :-
	by3(Rows,First-[],Second-[],Third-[]),
	maplist(distinct_sets,[First,Second,Third]).

distinct_sets(Rows) :- row_sets(Rows,FSet,SSet,TSet),
                       maplist(all_distinct,[FSet,SSet,TSet]).

row_sets([],[],[],[]).
row_sets([H|Rows],L1,L2,L3) :- by3(H,L1-A,L2-B,L3-C),
                               row_sets(Rows,A,B,C).

by3([],A-A,B-B,C-C).
by3([N1,N2,N3|R],[N1|F]-A,[N2|S]-B,[N3|T]-C) :- by3(R,F-A,S-B,T-C).

In a nutshell, the predicate disjoint groups every third row (A, D, G) and every third+1 row (B, E, H), and every third+2 row (C, F, I) together and then applies the same grouping within rows to create the disjoint sets. If you have a better implementation of the disjoint group constraint, then email me. And if you think you understand how it works and want to implement a solve yourself, give this puzzle a try. I would love to see a good Prolog solver for it.

Update [22.07.2021]

I’ve been sent this clever implementation of the disjoint group constraint by Janne U. using a DCG:

disjoint_groups2(Rows) :-
	phrase(blockrows(Rows), Blocks),
	transpose(Blocks, BlocksT),
	maplist(all_distinct, BlocksT).

blockrows([]) --> [].
blockrows([[],[],[]|R]) --> blockrows(R).
blockrows([[N1,N2,N3|Ns1], [N4,N5,N6|Ns2], [N7,N8,N9|Ns3]|R]) -->
	[[N1,N2,N3,N4,N5,N6,N7,N8,N9]],
	blockrows([Ns1,Ns2,Ns3|R]).

Footnote

[0] I consistently use here the predicates from the CLPFD library, rather than the vanilla mathematical predicates available in Prolog.

Notes on Standing and Occasion Meaning

Wed, 03 Feb 2021 15:37:13 +0000

Lexical semantics investigates the meaning of words, but one might distinguish multiple levels of meaning for a single word. For example, the word “semantics” might have a very broad and a very narrow meaning at the same time, depending on how much contextual information we take into account. The relative dependence on contextual information created lexical levels of meaning. In this post, I will share a few thoughts on the simplest distinction of lexical levels of meaning, that is the distinction between standing and occasion meaning of a word. The main insight will be that the distinction faces a problem with regard to homonyms and that current NLP approaches can be interpreted as implementing a specific solution to this problem.

Standing and Occasion Meaning

Standing meaning is the meaning of a word in general, while occasion meaning is word meaning for a specific occasion. By drawing this line, we split word meaning on the one side into a generic core without contextual dependence, and on the other into a specification that depends on the linguistic context. To give an example, “student” might broadly denote learners, while in a specific context the denotation of the word might be narrowed down to the registered participants of a class. The one is the standing and the other is the occasion meaning.

The distinction between standing and occasion meaning has a long history, going back at least to the late 19th century work of the linguist Hermann Paul. Paul distinguished between usueller and okkasioneller Bedeutung (usual/standing and occasional meaning) (see Geeraerts 2010: 14-16).[0] It is perhaps the most frequent way of distinguishing levels of meaning for a word.

Faced with this distinction, we might wonder what exactly instantiates two types of meaning? So far, I have generically written about the meaning of a word, but we can be more specific. Either the word type or the word tokens could instantiate the two levels of meaning. A relatively intuitive response is to attribute the standing meaning to word types and the occasion meaning to word tokens, as Recanati (2012) does. After all, the standing meaning and the word type are both more abstract and generic, while the occasion meaning and the word token are more concrete and specific. But this response is not without problems, because it commits us to word types having one specific meaning in the absence of any context. In the case of homonyms, such as “bank” the standing meaning becomes unclear.

The Problem of Homonyms

What is the standing meaning of “bank”? There is arguably not one standing meaning for this string which covers all uses. Prima facie, “bank” in the sense of financial institute does not share a meaning with “bank” in the sense of edge of a watercourse. In such cases, we have multiple options:

Assign the word types a single disjunctive meaning.
Distinguish two or more standing meanings for the word type.
Distinguish two or more homonymous word types each with their own standing meaning.
Deny the existence of a standing meaning and only postulate occasion meaning for the word type in question.

The first option sticks with word types having a single standing meaning and just renders it disjunct. We would think of “bank” as meaning financial-institute-or-edge-of-watercourse. But when words develop many meanings, then this standing meaning will become unwieldy and increasingly empty. “Bank” can also refer to buildings housing financial institutes and to databanks, and to piggy banks, etc. The resulting disjunct will be incredibly broad.

On the second option we just accept that a word type, that is one abstract word string, can have multiple standing meanings and the context will then pick out one of them for further specification. In response, one might ask whether “bank” also has the standing meaning of the German “Bank” which means bench. If we individuate words purely as strings that instantiate meaning, it would seem to be the case, but that might be at odds with an understanding of words that makes them specific to languages.

Recanati chooses the third option and individuates word types in terms of their standing meaning, but I don’t find it an obvious choice. Individuating word types in terms of standing meaning creates the challenge of changes in word type meaning over time. We might want to say that a word type has undergone (standing) meaning change, but this assertion would become incorrect if word types are individuated in terms of their meaning. Their meaning would become an individuating characteristic, requiring us to postulate a new word whenever we detect a different standing meaning. While hat does not rule out the third option, it makes it less appealing.

The fourth option abandons the distinction between standing and occasion meaning, at least for words types that are homonyms. This move throws into doubt the whole project of having general levels of word meaning. After all, the standing and occasion meaning distinction was supposed to be the simplest possible differentiation between levels.

The fifth solution would be to diverge from Recanati even further and assign both standing and occasion meaning to word tokens rather than word types. It seems a bit odd, however, to claim that words have a meaning only relative to a specific use.

I’ll not argue at length for one of these options here – my favourite is the second option, but the differences can be quite subtle – instead I’ll end this post by considering the problem from the perspective of contemporary NLP.

A Few Remarks from NLP

Some neural architectures, prominently transformer architectures, represent lexical meaning at multiple points. Specifically, they represent meaning at an initial embedding layer and at later points of encoding. As a result, non-contextualised and multiple sets of contextualised word representations can be extracted from e.g. BERT models (Devlin et al. 2019).[1] We could then go on to identify the non-contextualised embeddings with the standing meaning and the final contextualised embeddings with the standing meaning. This interpretation is sketched in Emerson (2020).

Of these models, we can then ask how they deal with the problem of homonyms. BERT and similar approaches effectively implement the first option and assign a disjunct standing meaning. The initial embedding for “bank” does not differ for the two homonyms. That could be changed, of course. We could preprocess the data with a coarse-grained word sense disambiguation (WSD) system and create different initial embeddings based on the results (cf. Trask et al. 2015).

If you have a preference for either the second option, as I do, or Recanati’s option of individuating word types in terms of standing meanings, then you would not be satisfied of equating the initial embeddings with standing meanings. The introduction of coarse-grained WSD would fit these approaches much better.

Footnotes

[0] Another important source for this distinction is Quine (2013), who distinguished occasion from standing sentences. But Quine’s approach is much more behaviourist.

[1] I neglect here that BERT uses sub-tokenization.

References

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805

Emerson, G. (2020). What are the Goals of Distributional Semantics? Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7436–7453. https://doi.org/10.18653/v1/2020.acl-main.663

Geeraerts, D. (2010). Theories of Lexical Semantics. Oxford University Press.

Recanati, F. (2012). Compositionality, Flexibility, And Context Dependence. In The Oxford Handbook of Compositionality (pp. 175–191). Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199541072.013.0008

Trask, A., Michalak, P., & Liu, J. (2015). sense2vec—A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings. ArXiv:1511.06388 [Cs]. http://arxiv.org/abs/1511.06388

Quine, W. V. O. (2013[1960]). Word and Object (new edition). MIT Press.

Follow Up: Why Learn Prolog in 2021?

Thu, 07 Jan 2021 15:37:13 +0000

My recent blog post arguing why one should lean Prolog in 2021 made its way to the front page of Hacker News (HN), where it started a discussion with more than 100 comments. I’m glad to see that some saw value in my post and I want to respond to a few comments from this discussion. Given the number of comments, I won’t write an exhaustive response but instead focus on a few themes I care about and try to fill some gaps I left in my original post. This post will be more eclectic and discuss:

How to evaluate the time investment into Prolog
Where the unfulfilled potential of Prolog might lie
Why I focussed on Prolog rather than another logic programming language
The aesthetic and epistemic reasons for learning Prolog

I consider this post a contribution to an ongoing debate, not its conclusion, and I hope it will be understood in such a way.

How to Evaluate the Time Investment

I framed my blog post as providing reasons to learn Prolog for university students. The question I sought to provide them with reasons to invest their time specifically in Prolog. Why should they spend the time on Prolog than other opportunities?

I believe that the opportunity costs are especially worth considering for students of computer science, because these costs can be larger than in other disciplines. If a CS student learns a skill that is in high demand on the market – and that is not the case for Prolog at this point – they might increase their future salary.

A student would not be well advised to invest much time into Prolog if they primarily sought to increase their future salary while avoiding risk. There are better risk-averse time investment opportunities, such as learning more about neural network technologies. Of course, this assumes a student who are motivated by risk-secure monetary outcomes and would be able to act on this motivation.[0] Instead students

might be motivated by other aims more strongly than by the additional income, or
might be willing to take a riskier bet to raise their income, or
might find themselves unable to resist watching funny Youtube clips for the sake of earning money but able to resist them for the sake of learning Prolog.

More needs to be said about the case of students willing to take risky bets. My last post subtly suggested that Prolog might lead to an increased income, if at some future date Prolog releases an unfulfilled potential. Learning Prolog is a risky investment, but on the assumption of unfulfilled potential it is an investment with a potentially large pay-off.

What Is This Unfulfilled Potential?

In response to my claim that Prolog has unfulfilled potential, HN user mths wrote

Is there any reason to believe the paradigm will somehow come into its own in the future? The way this question was addressed by the article was way too wishy-washy for my taste. I freely admit that I did not provide all that many details on this issue, because I feared the challenge of the predicting the future direction of Prolog and the embarrassment if I my specific predictions turn out to be mistaken. The comments, however, made me realise that I need to

One area where many Prolog aficionados tend to see unfulfilled potential is constraint logic programming. Programming with constraints is powerful and has been explored relatively little. Constraint programming is also an area with relatively clear applications. It isn’t only helpful for solving Sudokus or logic puzzles – although that is true as well[1] – but e.g. also for various engineering tasks.

There are other contenders in the space of constraint programming, but Prolog’s design allows us to integrate constraint logic programming seamlessly into larger projects. As The Power of Prolog website puts it, constraints “blend in especially seamlessly into _logic programming languages like Prolog due to their relational nature and built-in search and backtracking mechanisms”.

Another area where I believe Prolog has unfulfilled potential is the task of fully interpretable automated reasoning. Fully interpretable reasoning requires the ability to follow step by step the inferences processes, including the evidential basis, in a way that is comprehensible to the human cognitive architecture. While much work has pried open the black box of neural networks, I don’t see this level of interpretability reachable without much revision of our neural architectures. Admittedly, or many applications, this lack of full interpretability is acceptable and in some domains it might even be unavoidable. In some domains, however, we might expect such a level of interpretability. The legal domain is a case in point. For at least some parts of legal decision processes which might strip people of their most basic freedoms, we ought to do our best to provide a fully interpretable inference process. In these domains, Prolog or Prolog-like languages have unfulfilled potential.

In addition to these two specific areas, let me offer a highly abstract reason why one should expect Prolog to have unfulfilled potential. I assume that Prolog is the main example of the logic programming paradigm, which in turn I assume to be one of the three major programming paradigms: imperative, functional, and logic. If those assumptions are granted – and there are plausible reasons to object to them – the question arises why the logic programming paradigm should be the only out of the three paradigms which does not find major areas of application. To my mind, it appears unlikely that there would be an entire approach to programming with a negligible domain of application. This argument is more of a hunch, but such hunches are the best guidance we have when it comes to hard-to-quantify unknowns, such as whether a technology has potential no one has even conceived of yet.

But Does It Have to Be Prolog?

A few discussion participants saw merit in the logic programming paradigm, but felt less comfortable with Prolog in particular.

For example, HN user qart asked:

I wonder… is it still justified to learn Prolog now? Aren’t there better alternatives for logic programming in many other common programming languages? I mean http://minikanren.org/

Another example of a logic programming language that popped up repeatedly in the discussion is Mercury.

I lack experience with both miniKanren and Mercury and so I won’t argue that they are better or worse realisations of the logic programming paradigm. Instead I want to suggest that one should prefer to learn Prolog, because there are more resources available for Prolog. Many of these resources were linked to in the HN thread itself.

Furthemore, given the presumed audience of university students, the question is mostly moot. Usually, departments would offer only one course in logic programming or require learning Prolog before learning more about other logic programming languages. The choices are limited by the curriculum.

I’m open to the thought that the future of logic programming will not be exactly ISO-compliant Prolog. That being said, I’d be surprised if no key elements of Prolog would be available in that assumed future language, be it Horn-clauses, unification with backtracking, or Prolog-style constraint logic programming.

Aesthetic and Epistemic Reasons

I don’t think one should fool oneself about learning Prolog and the lack of demand for Prolog skills on the job market, which was also a major topic on the HN thread. But I also made two other arguments, one appealing to the aesthetic and the other to the epistemic properties of Prolog. From the comments, I got the impression that these properties carried more weight with some discussion participants than with others. That is to be expected. What stood out to me, however, is that relatively few HN users questioned that Prolog has these properties.

Perhaps the closest to arguing against my claim that Prolog is intellectually beautiful and epistemically revelatory would be the comments criticising the language for living up to its own ambitions. For example, infogulch complained that:

Prolog implementations are too heavily reliant on the stated order of predicate rules in order to make execution progress.

I have sympathy for such criticisms and in my post I admitted that “occasionally Prolog falls short of the programming-by-description-paradigm”. But much of the beauty derives from recognising the paradigm of which Prolog is the main example. Compare the experience to reading a novel in which the author has captured a completely new way of conceiving an aspect of reality, but some chapters fail to reflect this original conception. While one might argue that the novel is uneven, the new conception of reality within its pages is certainly a reason to read it. It seems to me that criticisms such as the one by infogulch are analogous.

In addition, Prolog is still being improved with regard to the aforementioned shortcomings. Prolog develops continue to work on bringing the language more in line with the ideas that render it beautiful and epistemically revelatory.

Conclusion

My former post was intended as the best case I can make for Prolog within a blog post. I had an imagined audience, but I was writing the post to reflect on what might justify my own interaction with Prolog, including offering supervisions for a course. While my main aim was to offer a justification independently of its uptake by anyone else, I won’t deny that I derive great satisfaction from positive responses, such as simongray writing

This inspired me. What’s the best book for modern prolog?

I am grateful for everyone who gave my blog post a chance.

Footnotes

[0] I am also writing this in the context of a limited departmental curriculum, where students cannot just choose any course, but I won’t go into that here.

[1] See https://www.metalevel.at/sudoku/ and https://www.metalevel.at/prolog/puzzles.

Why Learn Prolog in 2021?

Tue, 05 Jan 2021 21:37:13 +0000

?- learn(prolog).

Why should one learn Prolog in 2021? I should better have an answer to this question, because I will soon offer supervisions for a Prolog course. While I’m a personal admirer of this unusual programming language, students might rightfully demand a justification that goes beyond my preferences. Prolog certainly isn’t the most glamorous programming language to learn in 2021. Despite its lack of popularity, there are good reasons to learn Prolog and in the following, I’ll explore three of them.

The Sheer Intellectual Beauty

Perhaps my background in philosophy helps explain my fondness for Prolog. Not only is first-order predicate logic taught to virtually all philosophy students as a tool for thought, but it also forms the foundation of Prolog’s logic-programming paradigm. Philosophers aim for a logical description of the world and Prolog goes beyond this ambition by allowing us to manipulate reality via a logical description. We solve problems by writing Horn clauses, and a Horn-clause is a logical formula that simplifies resolution. Logical formulas are the tool of problem solving. Once grasped, the idea of logically describing a problem and the having the computer solve it is almost irresistible.

Of course, occasionally Prolog falls short of the programming-by-description-paradigm. There are cases where Prolog mixes logic and control instead of keeping them apart.[0] Nonetheless, logic programming, the paradigm of which Prolog is the primary example, comes with its own intellectual appeal. From the perspective of intellectual aesthetics, good Prolog code is a sublime experience (erhabene Erfahrung).[1] Such Prolog code reveals the overwhelming power of logical description and the force of a capacity – the capacity to describe the world in logical terms and thereby solve problems – that resides in all of us.

In sum, Prolog code has a timeless beauty to it – a claim that I believe is more commonly associated with the S-expressions of LISP – and is therefore worth learning. I am aware that an appeal to beauty has its limits, but the aesthetic properties of a programming language should not be entirely discounted. Our sense for intellectual beauty is an important tool for creation and it needs to be trained. If one understands what makes an approach beautiful, it becomes easier to create beautiful code and to resist the lure of beauty when it distracts from practical concerns. Learning Prolog is a way to tame the power of beautiful code.

?- beautiful(prolog).
true

A Different Perspective on Classical Issues

Recursion, list manipulations, and graph-hopping are standard topics of foundational computer science and Prolog addresses them with a twist.[2] Prolog offers a different perspective on classical issues of computer science, usually right away from the first lessons. As a result, Prolog has a relatively steep learning curve, but the different perspective can also be revelatory. One learns to describe classical problems in the format of Prolog Horn-clauses and thereby solve them, which can lead to a unique way of understanding them – especially, once one has learned to write idiomatic Prolog.

Prolog is not only beautiful, but it also reveals another aspect of the core issues of computer science to which it is applied. Occasionally, the aspect Prolog reveals is also the aspect that needs to be seen for solving a problem. Some problems call for Prolog. Having learned Prolog will allow one to address them beautifully and efficiently. To be honest, at the moment such problems are too rare to justify learning Prolog. But I don’t believe that this has to remain so. As my last argument in favour of learning Prolog, I will suggest that it has unfulfilled potential.

?- unique_perspective(prolog).
true

Unfulfilled Potential

As a student of computer science, one can make a decent career by always following the hype, but to stand out one has to diverge from the well-trodden paths. Those willing to explore unpopular territory have a chance of being ahead of the crowd. In 2021, Prolog is such unpopular territory. In my field of NLP, one might instead opt to learn more about neural networks and especially the Transformer architectures such as BERT. Learning about these topics is certainly advisable for a career in NLP, but it won’t make one stand out.

Prolog is unpopular and, more importantly, I believe that it has not fulfilled its potential so far. The logic-programming paradigm with its separation between logic and control is powerful. Yet it does not find much use in current applications. This unpopularity despite power might deter a student from learning Prolog – perhaps logic-programming has faults which keep its from being successful – but it is also an opportunity. One can make the bet that more will come of Prolog or a language similar to it.[3] If the bet is successful, one will be ahead of the hype.

Such bets on unpopular options are risky. It is a high-reward bet because of the limited chances of success. That being said, I would advise making a few such bets in the course of one’s life. Even if nothing comes of them, they render life more interesting and help to show individual character. Perhaps one shouldn’t go all in on such a bet, but this consideration should justify the few hours of a Prolog course, when one gets academic credits in addition to being able to assess the unfulfilled potential of Prolog better.

?- potential(prolog,Y), unfulfilled(Y).
true

Conclusion

Currently, Prolog does not belong to the most popular programming languages. Its logic programming paradigm makes it an outsider. Nonetheless, I’ve argued that there are good reasons to learn Prolog. The language is beautiful, it offers a different perspective on classic computer science issues, and it has unfulfilled potential. Whether you are motivated by aesthetic, academic, or career considerations, you have a reason to learn Prolog in 2021.

learn(X) :- beautiful(X), unique_perspective(X), potential(X,Y), unfulfilled(Y).

?- learn(prolog).
true

UPDATE This blog post made its way to the frontpage of Hacker News where it received a sizeable number of comments. In response, I wrote a follow-up post.

Footnote

[0] I’m referencing here Robert Kowalski’s formulation of Algorithm = Logic + Control.

[1] I hope Kantians can forgive me for treating the sublime (das Erhabene) as a type of beauty, neglecting Kant’s distinction.

[2] For an example, have a look at the quicksort implementation on The Power of Prolog.

[3] I’m not the only one who has hopes for Prolog’s future.

End of Year Post: 2020

Thu, 31 Dec 2020 21:00:13 +0000

What has 2020 brought? In this post I want to offer a selective reflection on my research career and its developments in 2020. I will take a perspective that is at the same time deeply personal and highly abstract. From my personal heights, I’ll gesture at the turns I’ve taken this year, note a few outcomes, and point towards my future commitments.

Reorientations and Pivots

My 2020 was characterised by multiple research reorientations and pivots. Many of these pivots I made privately, barely discussing them with my closest friends. These changes resulted from my assessment of my career trajectory and the various research fields into which I have tipped my toes. I don’t want to go into too much detail, but I intend to work less on social ontology and more on natural language processing (NLP). Generally, the path I chose is indicated by my most recent blog posts, which focus on the semantic dimensions of natural language, including lexical semantics. That choice, however, came after much trepidation and pivoting and in the last quarter of this year.

Why the need for a pivot? All too often academia ends up tying intelligent people to a misguided path of research. Various aspects of academic institutions, such as the need to develop a publication record in a narrow subfield, incentivise researchers to stick with their research projects even when emerging evidence weighs against it. While sometimes sticking to one path pays off in the long run – the researchers who stuck to neural networks during the various AI winters are a prime example of success – in many cases it misallocates human talent. Our human capacities are not maximising the scientific progress of humanity, or much else of value.

The possibility of wasting my limited capacities on fruitless research endeavours frightens me greatly. Hence, I have a history of abandoning research directions with which I have become disillusioned. A few years ago, before the final stages of my PhD, I was working in the history of philosophy and specifically on German Idealism, but by now that seems far removed from my research interests. While I can still derive joy from picking up a book by Hegel and perusing it, I cannot see myself dedicating my life to it. Hegel does not appear in my philosophy PhD thesis and after finishing my thesis, I completed an MPhil in advanced computer science, moving into NLP.

The opposite worry of wasting my capacities on fruitless endeavours is that my endless pivoting will not lead to any lasting scientific contributions either. Scientific progress relies on risky up-front investments and other than sheer luck, there is no way around that fact. Given the advanced state of most scientific fields, researchers have to delve deep into a field to contribute. Accordingly, I also fear the prospect of my research career flailing endlessly. That being said, I hope that my decisions in the later part of 2020 put me on a promising research path. Directly or indirectly, my future blog posts will reveal whether my hope is misplaced.

Wrapping up Projects

While I kept pivoting between different research interests of mine, I also wrapped up some projects. Academically, these wrapping up events realised themselves as publications. I have published in the Canadian Journal of Philosophy and Synthese, both of which are fairly prestigious philosophy journals. Another philosophy paper has been accepted for publication and should appear in the next few months. These three papers are exploratory stepping stones in my research career. Although some of their insights will inform my future inquiries, I will abandon much of them. It would be a great joy to me if someone else would pick up the abandoned pieces and developed them into more than I have been able to. If you have any interest in that, feel free to drop me an email.

Perhaps I should do a better job of advertising and selling these papers – and since I put considerable time and effort into them, I hope that they are of value – but in this post I am trying to reflect on the overall development of my academic career, and I doubt that these papers will be the most remarkable ones of my career. In fact, I would be rather disappointed in myself if they turned out to form the pinnacle of my research. My ambitions have not been realised yet.

Forward into 2021

I go into 2021 with a renewed sense of commitment to furthering the scientific progress of humanity within the bounds of my limited capacities and interests. Over the last 10 years, I learned, read, and wrote without excessive regard for disciplinary boundaries. Towards the end of my PhD, I started to question my research trajectories – not that I was ever certain about them – and I explored how I might live up to my commitment of furthering scientific progress. As a result, I expanded into computer science in 2018, but I avoided decisions about my career until they become more pressing over the course of this year. In 2021, I hope to build upon the restructured foundations of my research career and start living up to my commitment. Maybe I will read some more for it in the last few hours of this year.

For the scientific progress of humanity!

Simulating Basic Logic with Tensors

Fri, 18 Dec 2020 13:00:13 +0000

Can we simulate basic logic operations, i.e. the operations of first-order predicate logic, using tensors? In his 2013 paper “Towards a Formal Distributional Semantics: Simulating Logical Calculi with Tensors”, Edward Grefenstette made some suggestions for such simulation. The paper’s motivation was to take a step towards combining distributional with formal semantics. I’ve explored this paper in a Jupyter Notebook, which I put on github.

At the moment, the github notebook viewer breaks some of the LaTex formulas, but you can see it in the Jupyter notebook viewer here.

Exploring Basic Distributional Representations

Sat, 28 Nov 2020 19:50:13 +0000

I’ve recently been reading up on distributional representations, that is representation of meaning that are based on count vectors. They were the exciting technology before neural networks and the embeddings networks create changed the field of NLP. Nowadays we do not count token occurrences, but let Word2Vec or BERT models create representations.

While they have decidedly fallen out of favour, distributional representations are clever pieces of technology and I wanted to get some more experiences with them. So I’ve put together a Jupyter Notebook that explores key aspects of that technology:

Creating a count matrix
Calculating Pointwise Mutual Information
Calculating similarity scores
Reducing the dimensionality of the representations

You can see the notebook on github

Of course, my notebook is merely an introduction to some of the most basic techniques. For example, I do not explore incorporating syntactic information. Still, I hope it shows that these by now largely neglected techniques are fascinating application of statistical NLP.

Conceptual Grain

Mon, 16 Nov 2020 19:47:13 +0000

In this blog post I share some preliminary musings on conceptual grain – how fine-grained concepts such as DOG and MAMMAL are – which have arisen from my work in NLP and specifically word sense disambiguation. The upshot is that we can develop multiple metrics of conceptual grain and that we have to address the question of what we want these metrics to do for us.

The classic task of word sense disambiguation (WSD) seeks to assign senses to word tokens in contexts. When I give you a tip, I might give you either advice or money for your service. A WSD system should assign the correct sense, but assigning a sense to “tip” relies on a repository of senses from which a WSD system can draw.

WordNet is the dominant sense repository in automatic word sense disambiguation (cf. Fellbaum 1998), but its shortcomings have been known for a long time. One of these shortcomings is an exceedingly fine grain (cf. Ide & Wilks 2007; Navigli 2009). The concepts are too finely distinguished for current technology to perform well and arguably even for human annotators. WordNet offers 33 senses for the token “head”, so there is a good chance that some of them get confused some of the time.

Despite the common complaint, the notion of grain on which it turns has remained rather unspecific. On the simplest interpretation, the grain of a sense repository such as WordNet is just the number of senses in it. There are just too many senses in WordNet! While fewer sense labels certainly would make it easier to create a classifier for WordNet, we might also look for a notion of grain with a little more theoretical heft. If we have our concepts organised with semantic relations, can we then describe grain in terms of such a linguistically founded organisation of concepts?

Consider the concepts[0] of DOG and MAMMAL. You might propose that since dogs are a kind of mammal the concept is more fine-grained. In more linguistic terms, the hyponomy hierarchy of concepts provides a partial ordering of grain. A hyponym (DOG) is more fine-grained than its hypernym (MAMMAL).[1] Or to be a bit more formal, assume we have a tree of hyponyms and hypernyms, i.e. a taxonomy tree[2], then the depth at which a concept can be found in this tree could be considered its grain. Hence, we can define an order of grain using a function depth(), i.e. grain(DOG) ≥ grain(MAMMAL) if and only if depth(DOG) ≥ depth(MAMMAL).

But there are other ways to specify the notion of grain. Consider again the example of “head” and its 33 senses. Prima facie, the problem here is not that the 33 senses are deep down in the hyponomy tree. The problem is that there are just too many senses that are closely related. Once again, we can approximate the intuition with features of the taxonomy tree. Specifically, the branching factor of nodes in the tree provide an indication of how many closely related concepts there are.[3] In other words, it would hold that grain(DOG) ≥ grain(MAMMAL) if and only if hyper-branching-factor(DOG) ≥ hyper-branching-factor (MAMMAL), where hyper-branching-factor() returns the branching factor for the closest hypernym.[4] The assumption is that if the senses of “head” are really too close, they will be child nodes of densely populated hypernym nodes in the taxonomy tree.

So far, I considered the taxonomic tree to be constant and then pointed at features of it – depth and branching factor – to suggest metrics of conceptual grain. Instead one could postulate an ordering of increasingly detailed taxonomic trees. Assume you create a taxonomic ontology and you add batches of nodes to it in a natural way. Then the stages of your taxonomic tree will each have a certain grain. At the beginning you will have a very coarse-grained taxonomy and with each step it will be finer-grained. You can now have a function introduction-to-tree() which returns the number of the stage at which a certain concept was introduced. Then, grain(DOG) ≥ grain(MAMMAL) if and only if introduction-to-tree(DOG) ≥ introduction-to-tree(MAMMAL).

Admittedly, this measure has a problem, namely the need for an ordering of node batches in a “natural way” of adding them. Much of the subtleties of conceptual grain are hiding there. It won’t do to just record the steps in which nodes where added to WordNet or any other ontology, since chance and history will not follow such a natural order. A concept might be added later to the tree, for many reasons that would not support an inference about its grain – maybe people were too focused on some other topic domain and forget about the more common and coarser-grained concept.

All of these metrics have their positive and negative sides, depending on what we want to use them for. The use cases provide criteria for evaluation. One of the original reason for introducing a notion of grain was to allow us to complain about WordNet as being too fine-grained for word sense disambiguation. It has too many fine-grained concepts or it has concepts with too high an introduction-to-tree factor for our classifiers. Hence, I propose this first criterion:[5]

Conceptual grain correlates with difficulty in addressing WSD as a classification task.

In addition to this NLP-driven criterion, we can also use some linguistic intuitions about grain – in both senses of linguistic – to evaluate the metrics. Such criteria serve the purpose of ensuring that the metric can support linguistic theorizing.

More fine-grained concepts should be (pragmatically?) exchangeable in more linguistic contexts than coarser concepts.
The grain of a concept should correlate with the length of a na&#xEFve definition we might give for it. Further criteria could be proposed to ensure the integration of the metric in other disciplines, e.g. cognitive science. From the question of what is grain, we are driven the to the issue of what we the notion and its metrics to do for us.

Footnotes

[0] I use “sense” and “concept” interchangeably in this post.

[1] I assume that the hyponomy relation holds between concepts, not words. Otherwise I am not sure how to handle polysemy.

[2] Maybe concepts connected by hyponomy edges form a directed graph and not a tree, but let’s not get bogged down in that for now.

[3] This measure ignores the proximity between hypernyms and hyponyms, but the next one arguably captures it.

[4] A generalization would be to take the average branching factor of set of hyponyms.

[5] A 0th implicit criterion was that a metric of conceptual grain should provide at least a partial order over our concepts.

References

Fellbaum, C. (Ed.). (1998). WordNet: An Electronic Lexical Database. MIT Press.
Ide, N., & Wilks, Y. (2007). Making Sense About Sense. In E. Agirre & P. Edmonds (Eds.), Word Sense Disambiguation: Algorithms and Applications (pp. 47–73). Springer Netherlands. https://doi.org/10.1007/978-1-4020-4809-8_3
Navigli, R. (2009). Word Sense Disambiguation: A Survey. ACM Computing Surveys, 41(2), 1–69. https://doi.org/10.1145/1459352.1459355

New Paper (CJP): Social-Computation-Supporting Kinds

Tue, 11 Aug 2020 20:47:13 +0100

The Canadian Journal of Philosophy has published my paper on what I call “Social-Computation-Supporting Kinds”. This paper is a first attempt to re-describe the role of computation in social ontology. I argue – in move I would self-servingly love to call “bold” – that there is a kind of social kinds which is distinguished by supporting social computations, that is groups implementing computational processes.

I want to stress that it is a first attempt and will leave many questions open. It’s value lies, hopefully, in doing something different in the social kinds debate and sketching the value this new approach will have. I expect to publish more on this approach.

The paper is once again published Open Access, thanks to the University of Cambridge.

I also presented the gist of the paper at the recent online Social Ontology conference. My video presentation from this conference is still available for those who don’t want to read the paper.

Presentation at the 2020 Social Ontology Conference

Tue, 07 Jul 2020 11:57:13 +0100

I am presenting at the 2020 Social Ontology Conference and because it is virtual, you can all watch it online. The conference website provides videos of all talks.

In my talk, I discuss the notion of social-computation-supporting kinds. A longer paper exploring the idea will hopefully be published soon. The main advantage of the video is the excellent pixel art.

There will also be a Q&A session, but they haven’t been scheduled yet.

More Social Ontology Highlights

Sat, 04 Jan 2020 13:57:13 +0000

I’ve recently posted a short list of social ontology highlights from 2019, but Kirk Ludwig sent me a much more extensive list. I present it here in rearranged form.

Publications

Tony Lawson: The Nature of Social Reality: Issues in Social Ontology (Economics as Social Theory)
Angela Condello, Maurizio Ferraris, John Rogers Searle: Money, Social Ontology and Law
Trish Reay, Tammar B. Zilber, Ann Langley, and Haridimos Tsoukas (eds.): Institutions and Organizations. A Process View
Holly Lawford-Smith: Not In Their Name. Are Citizens Culpable For Their States’ Actions?
Luka Burazin, Kenneth Einar Himma, and Corrado Roversi (eds.): Law as an Artifact
J. Adam Carter, Andy Clark, Jesper Kallestrup, S. Orestis Palermos, and Duncan Pritchard (eds.): Socially Extended Epistemology

SEP Revisions

The following three articles in the SEP received substantial revisions:

Events

On top of Kirk’s list was the Social Ontology/ENSO conference in Tampere, which I had already on my list. Otherwise there were three workshops/conferences at Vienna on:

Social Agency, Group Agency & Relational Normativity
Group Agency and Collective Responsibility
Shared Agency, Rationality, Normativity (with Michael Bratman and David Velleman)

There were two events in Milan:

Adelaide de Lastic “The Political Dimension of an Enterprise’s Collective Agency”, Thursday 28 March 2019
A Pragmatist take on Social Ontology: Habits, Social Practices, Statuses

Other events:

The 2019 Pacific APA meeting had a symposium on Kirk’s second book, From Plural to Institutional Agency: Collective Action 2. (With Maria Jankovic and Carol Rovane as commentators)
Workshop on Social Ontology, Normativity, and Philosophy of Law, Glasgow University Law School, May 30-31
European Network for Philosophy of the Social Sciences (ENPOSS) in Athens
The PPE Society meeting in spring 2019 also featured some social ontology papers

Near Future

Kirk also had some forthcoming publications on his list:

Anika Fiebich’s edited book Minimal Cooperation and Shared Agency has slid into 2020:
An issue of Language and Communication is coming out on group speech acts but the papers are already available online
Saba Bazargan-Forward, Deborah Tollefsen (eds.): The Routledge Handbook of Collective Responsibility

That is quite a list and I have to admit that I was not aware of much that Kirk found. I hope others can benefit from it as well.

Best of Social Ontology 2019

Mon, 23 Dec 2019 18:57:13 +0000

Social ontology, by which I mean a subfield of contemporary analytic philosophy, is a comparatively small enterprise so far. That makes gathering a best-of-2019 list difficult. There just aren’t that many great papers coming out each year, or other notable events. Here are five highlights I could find. Feel free to send me an email and suggest other contributions to the field. I might update this entry later.

Brian Epstein’s paper “What are social groups? Their metaphysics and how to classify them” has been available as forthcoming for a while, but the official publication date has been 2019, which hopefully justifies including it in this list.[0]

The International Social Ontology Society has started a Youtube channel this year and published keynotes from last years conference.

The Social Ontology conference in Tampere, organised by the European Network of Social Ontology. I believe the keynotes have been recorded, so there is hope that they will appear on ISOS Youtube channel at some point.

Finally, there has been a monist issue on the topic of Collective Responsibility and Social Ontology.

BONUS: Arto Laitinen has told me that a book symposium on Ásta’s Categories We Live By will soon appear in the Journal of Social Ontology (and dated as being from 2019).

Footnotes

[0] Two other publications on the ontology of groups deserve honourable mentions, even though they have appeared in 2018 (as forthcoming without a date yet in the first case). The first is Katherine Ritchie’s “Social Structures and the Ontology of Social Groups” and the second is Gabriel Uzquiano’s “Groups: Toward a Theory of Plural Embodiment”.

A Different Map of the Tractatus

Mon, 02 Sep 2019 18:57:13 +0100

Over the years there have been a number of visualisations of Wittgenstein’s Tractatus Logico-Philosophicus. Most of them have made use of the tree structure Wittgenstein imposed on his text. With today’s web-technologies, these representations of the text can be excellent. In this post, however, I present a map of the Tractatus unlike any of these previous experiments.

The picture shows me playing with an interactive representation of all statements in the Tractatus, each represented by an embeddings. Embeddings are vector-representations of meaning. Usually they are created on the level of tokens, but there are ways of aggregating them to higher levels. I took the relatively easy path of averaging the embeddings for all the tokens in the statements.[0] The result should be a map of how strongly the statements are semantically related.[1] The closer two vectors are, the closer the statements are in their meaning, at least that is the idea.

There are a variety of ways to create embeddings, typically making use of artifical neural networks. The Word2vec library made embeddings popular, but I wanted to explore something more cutting-edge for this visualisation. So I used a pretrained-BERT model to create the vectors. BERT is based on the now fashionable transformer networks (see here for a technical explanation).

The embeddings are just vectors, to make them visually accessible I use the online projector tool. For this purpose, the hundreds of dimension of the embeddings are reduced to three. Information included in the embeddings is lost in this process. Hence, what you see is only an approximation of what the embeddings capture.

In contrast to a visualiation using the tree structure created by Wittgenstein, this approach can reveal something we haven’t been aware of. It can suggest connections no one has noticed before. I am not sure it does, but that it has the potential is exhilarating.

The code is available on github, including the embeddings in the TSV format needed for the projector tool.[2] Just go on the website, upload the two TSV-files and you can explore the tractatus in 3D.

[0] It is actually a bit trickier than that, because I use information from multiple layers in the neural network to create the token-embeddings.

[1] While embeddings capture some aspect of the semantic content of a token, they do not represent it entirely faithfully. As so much in machine learning, they are best seen as an approximation that works for certain purposes.

[2] I avoided putting the text of the Tractatus online, since I am not sure what the copyright situation is. If you want it, email me.

Upcoming Talk (August 2019)

Thu, 01 Aug 2019 16:57:13 +0100

For the fourth year in a row, I will present a paper at a social ontology conference this Summer. After the last one in Boston, I thought it would be time to do something more ambitious. While my previous papers went well enough and led to two publications, they made relatively narrow arguments. This year in Tampere my claims will be much bolder. I do not want to give too much away, but I will propose a sweeping change to how we explain the social and what makes it special from a metaphysical perspective. What makes social interesting should be fundamentally reconceived.

You should not miss this momentous event. If you do, you can email me at davidstrohmaier92@gmail.com to get an early draft of my paper.

Parsing Hegel

Thu, 23 May 2019 14:57:13 +0100

In another life I read a lot of Hegel, now a mere side-interest of mine. Despite the assurances of my former supervisor Bob Stern to the contrary, Georg Wilhelm Friedrich Hegel’s work is infamously opaque. Making sense of his Phenomenology of Spirit poses a considerable challenge, and those who claim to understand him often end up with rather different readings.

In my current life, I am finishing up an MPhil in Advanced Computer Science. My project is in the area of computational semantics where we seek to make sense of expressions in natural language by automatically producing formal representations of their meaning. For this purpose, I am using the Boxer-parser, which uses Discourse Representation Theory (DRT).[0] DRT offers a fancy formalism for capturing action-sentences using a neo-Davidsonian event semantics. One benefit of this theory is that it allows us to represent the meaning in neat little boxes, hence the namer of the parser. The boxes specify a number of variables at the top and then contain conditions in the form of predicates below.

If computational semantics enables us to make sense of natural language, then why not use it to make Hegel approachable? Why not run Boxer on the Phenomenology? I can think of very good reasons to resist the idea for the whole book, but not a single one of them kept me from giving it a try with a few sentences. So I just went ahead and adapted a tiny sliver of what I have learned during my MPhil to turn the first sentence of the Phenomenology into a formal representation.

The challenge should not be underestimated. The first two sentences read as follows:[1]

“It is customary to preface a work with an explanation of the author’s aim, why he wrote the book, and the relationship in which he believes it to stand to other earlier or contemporary treatises on the same subject. In the case of a philosophical work, however, such an explanation seems not only superfluous but, in view of the nature of the subject-matter, even inappropriate and misleading. “

This is not exactly “The dog chases the car”, an example much more adapted to the powers of Boxer. But I have to admit that Boxer surprised me. It managed to produce a representation of the first two sentences:[2]

Despite the intuitive character of the boxes, it is not exactly easy to make sense of the jumble. Boxer seems to have produced less than complete parses, hence the repetition of certain elements (e.g. “contemporary treatise”), but I am honestly impressed that I got anything at all. In fact, Boxer did not present a parse when offered the third sentence:

“For whatever might appropriately be said about philosophy in a preface - say a historical statement of the main drift and the point of view, the general content and results, a string of random assertions and assurances about truth - none of this can be accepted as the way in which to expound philosophical truth. “

Failing on such Germanic verbosity is nothing of which Boxer has to be ashamed. It ends, however, the hopes of rendering Hegel intelligible with the current technology.[3] If you generously fund me for four to five years, I will try to produce such representations for the whole of the Phenomenology. The decision whether that is a worthy investment of your money is up to you.

You can find the code I used in a public github repository, but you need to install the C&C parser as well as Boxer for it to work, which is a challenge in its own right.

[0] Kamp, Hans, and Uwe Reyle. From Discourse to Logic: Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory. Studies in Linguistics and Philosophy 42. Dordrecht: Springer-Science+Business Media, B.V, 1993.

[1] I am using the Miller translation.

[2] The parse neglects a few niceties such as representing the word-senses was WordNet synset and the like, but that is not the problem.

[3] As a sidenote, let me suggest that Hegel’s Phenomenology in fact works better with the neo-Davidsonian approach of Boxer than other philosophy texts, because it tries to describes the actions and experiences of spirit. What it describes is closer to action than what we find in most philosophy books.

Two New Publications

Tue, 16 Apr 2019 16:57:13 +0100

Two of my publications are finally out. Both of them are related to my PhD research into social ontology and the both investigate groups. The first one discusses group membership and argues that reducing it to mereological parthood plus further conditions is a viable option. The paper has an unusual history. Originally, I wrote another paper that argued the opposite conclusion, that is I tried to establish that all mereological accounts of groups fail. However, Katherine Hawley published a paper in the debate in 2018 and after reading it I decided that she was right, that we cannot take mereological accounts of the map at this point. So instead of publishing my first paper, I wrote a new one, filling a significant gap in the mereological account. You can find out everything in this crisp little piece.

My second paper undertakes a more ambitious project. It defends the conclusion that current interpretivist accounts of group agency fail and have to fail while functionalist accounts have a better shot. Like the first one, this second paper draws on the ontology of groups. Coinciding groups, that is groups which share all their members at all times, pose a special problem to interpretivist accounts, or so I argue.

I am proud to say that both papers have been published open access. As long as you have internet access, you can read them.