Speech Science Forum - Catherine Lai (University of Edinburgh)
28 October 2021, 4:00 pm–5:30 pm
Text-to-Speech evaluation in context
This event is free.
Event Information
Open to
- All
Cost
- Free
Organiser
-
Justin Lo
Abstract:
Recent advances in Text-to-Speech (TTS) systems have resulted in dramatic improvements in the naturalness of synthetic speech, where naturalness is generally understood in terms of similarity of synthesised speech to human speech. However, we know that what sounds ‘natural’ changes for different contexts. In particular, prosodic patterns that sound appropriate in one context, may sound inappropriate in others. This talk will discuss some recent work investigating how prosodic variation in synthesised speech is perceived, and what this means for current TTS evaluation methods. In listening tests, we see that neural TTS systems which receive high naturalness ratings for isolated sentence generation can be perceived as over-generating variation when evaluated in context. We find that perceived prosodic errors cluster around points where discourse driven expectations are stronger, e.g., due to question-answer coherence. Our findings support the idea that to evaluate TTS systems, especially in terms of prosody, we need to better understand what the expectations are being generated by the context. So, if we want to make more progress in TTS, it’s time to go beyond isolated audiobook sentences as the default evaluation!
About the Speaker
Catherine Lai
at University of Edinburgh
More about Catherine Lai