David Strohmaier's Website

Home · About · Blog · Publications · CV · Reading Lists


Proposals

MPhil Projects

These are proposals for MPhil project with the NLIP group. If you are an MPhil student in the ACS at Cambridge and you are interested in any of these projects, send me an email!

Project 0: Multi-Task Learning for Word Sense Disambiguation

Proposer: David Strohmaier

Supervisors: David Strohmaier and Paula Buttery

Words are often ambiguous. For many applications, for example, to explain a text to a language learner, we would like to automatically disambiguate them with representations of word meaning. These word meanings should be interpretable, in the best case also for language learners. To select interpretable meaning representations, NLP researchers make use of sense repositories, such as dictionaries and topic labels. Many such repositories have been created, but researchers have not settled on a single “correct” repository. In fact, there are good reasons to think that there is no such uniquely correct repository.

To overcome the challenge of selecting between sense repositories, you will be combining these repositories in a multi-task learning setup. You will fine-tune a BERT-based classification system to predict the correct meaning of a word based on its context for multiple repositories. You are free to explore different neural architectural choices (within time constraints).

Required Resources:

Relevant Literature


Project 1: Diachronic Lexical Complexity Prediction

Proposer: David Strohmaier

Supervisors: David Strohmaier and Paula Buttery

In education technology we often need to predict the complexity of a word token. Currently, the task of lexical complexity prediction is often approached with pre-trained language models, such as BERT. In this project, you will build such a lexical complexity prediction model and adapt it to a different application: prediction the diachronic lexical complexity in learner data.

Over the course of acquiring a new language, the grasp of learners on new concepts increases. It stands to reason, that a learners early uses of a content word might be less complex than their later uses, corresponding to the learners progress. Using your own lexical complexity prediction system, you will be able to track their development and evaluate this hypothesis. Generally, applying complexity prediction to diachronic learner data opens multiple application-oriented directions of research, which you can explore.

Required Resources:

Relevant Literature


Project 2: Probing Language Models for Learner Semantics

Proposer: David Strohmaier

Supervisors: David Strohmaier and Paula Buttery

In recent years massive language models that are difficult to interpret have become widespread in NLP. To understand their inner workings, we need to probe them with various techniques. This research project will develop your probing skills on an application-oriented task: Probing for the difference between native speaker and language learner data.

Learner data includes many mistakes, such as spelling errors and grammatical infelicities. How does this affect the processing of the lexical semantic knowledge within the language model? To answer this question, you will use state-of-the-art probing techniques and apply them to BERT (or one of its derivatives). The results will help us to assess the use of such language models in educational technologies.

Required Resources:

Relevant Literature


Project 3: Annotating Dictionary Definitions with Complexity Levels

Proposer: David Strohmaier

Supervisors: David Strohmaier and Paula Buttery

A dictionary that provides complexity levels (typically CEFR-levels) is useful for many tasks in NLP, e.g. readability assessment and text simplification. This project will attempt the automatic annotation of the Cambridge Advanced Learners Dictionary with CEFR-levels based on existing partial annotation. For example, the definition for “tip” as a small amount of money given to someone who provided a service should be annotated with the CEFR-level B1.

CEFR-level annotation can be conceived of as a classification task with six labels. Neural networks and especially the contemporary transformer-architecture, such as BERT, are suitable for classification tasks of this type. A preliminary BERT-classifier can be made available to the student. There are, however, special challenges to overcome for the automatic complexity annotation of dictionaries. Dictionary entries have their own format that diverges from natural occurring text corpora. Exploiting this format while using a standard architecture is a key part of this project.

In addition, different loss functions can be explored for annotating definition with CEFR-levels. The loss function should make use of the fact that picking the label B2 instead of B1 is closer to correct than if C2 had been picked.

Required Resources:

Relevant Literature