UCL Psychology and Language Sciences


Speech Science Forum - Catherine Lai (University of Edinburgh)

28 October 2021, 4:00 pm–5:30 pm

Speech Science Forum Logo

Text-to-Speech evaluation in context

This event is free.

Event Information

Open to





Justin Lo


Recent advances in Text-to-Speech (TTS) systems have resulted in dramatic improvements in the naturalness of synthetic speech, where naturalness is generally understood in terms of similarity of synthesised speech to human speech. However, we know that what sounds ‘natural’ changes for different contexts. In particular, prosodic patterns that sound appropriate in one context, may sound inappropriate in others. This talk will discuss some recent work investigating how prosodic variation in synthesised speech is perceived, and what this means for current TTS evaluation methods. In listening tests, we see that neural TTS systems which receive high naturalness ratings for isolated sentence generation can be perceived as over-generating variation when evaluated in context. We find that perceived prosodic errors cluster around points where discourse driven expectations are stronger, e.g., due to question-answer coherence. Our findings support the idea that to evaluate TTS systems, especially in terms of prosody, we need to better understand what the expectations are being generated by the context. So, if we want to make more progress in TTS, it’s time to go beyond isolated audiobook sentences as the default evaluation!

About the Speaker

Catherine Lai

at University of Edinburgh

More about Catherine Lai