UCL Psychology and Language Sciences


Speech Science Forum - Catherine Lai

21 March 2024, 4:00 pm–5:00 pm

Catherine Lai picture

Watch your tone! Exploring prosodic variation with(in) speech and language technologies

Event Information

Open to





Rana Abu-Zhaya


School of Pharmacy, 225
29-39 Brunswick Square

Recent advances in machine learning have made an undeniable impact on the field of speech technology as we’ve long known it. These advances have also led so some rather bold claims: e.g., Speech-to-text (aka Automatic Speech Recognition, ASR) and Text-To-Speech synthesis (TTS) are solved! What these sorts of claims often miss is that the traditional objectives of ASR and TTS neglect important aspects of spoken communication beyond text. For example, most work on automated spoken language understanding is built on text transcriptions, ignoring the fact that how we speak can change how our words are interpreted. In particular, previous work has shown that speech prosody (e.g. pitch, energy and timing characteristics of speech) can be used to signal speaker intent and affect, as well as to guide dialogue structure. We also know that prosody can be highly contextually variable. So, to make use of prosody in speech technology, we need to be able to model this variability and to understand what it actually does in spoken communication. In this talk, I will discuss recent work we’ve been doing in Edinburgh exploring prosodic variation in conversational (English) speech corpora using representation learning methods developed for speech generation and recognition. From the generation side, we use neural TTS models to investigate and generate prosodic cues for turn-taking in spontaneous dialogues using neural text-to-speech methods. From the recognition side, I’ll talk about how we can leverage large language models to investigate how non-lexical aspects of speech can affect expectations in spoken dialogue. I argue that there are lot of benefits to be had from self-supervised methods for representation learning on speech and text datasets, but we still need linguistic knowledge to actually make use of the true richness of speech.

This event will be hosted online as well: https://ucl.zoom.us/j/97846874460?pwd=bURvQkZVZjNaMkFkZXNySjI2RUplQT09

About the Speaker

Catherine Lai

Lecturer in Speech and Language Technology at University of Edinburgh

I am a Lecturer in Speech and Language Technology, based in the Centre for Speech Technology Research at the University of Edinburgh, where I work on spoken language understanding, affective computing, and multimodal speech processing. My main interest is speech prosody, e.g., intonation and rhythmic properties of speech, and how varying the way we speak can change our understanding of speaker intent in a dialogue from both a recognition and generation perspective. I work on this from both a speech technology/machine learning perspective, as well as a linguistic perspective, drawing on work in semantics, pragmatics and sociolinguistics. Amongst other things, I’ve also worked on speech processing for multimedia archives, evaluation and interaction with speech technologies, and spoken language processing for robot companions.

More about Catherine Lai