Speech Science Forum - Catherine Lai (University of Edinburgh)
Text-to-Speech evaluation in context

Abstract:
Recent advances in Text-to-Speech (TTS) systems have resulted in dramatic improvements in the naturalness of synthetic speech, where naturalness is generally understood in terms of similarity of synthesised speech to human speech. However, we know that what sounds ‘natural’ changes for different contexts. In particular, prosodic patterns that sound appropriate in one context, may sound inappropriate in others. This talk will discuss some recent work investigating how prosodic variation in synthesised speech is perceived, and what this means for current TTS evaluation methods. In listening tests, we see that neural TTS systems which receive high naturalness ratings for isolated sentence generation can be perceived as over-generating variation when evaluated in context. We find that perceived prosodic errors cluster around points where discourse driven expectations are stronger, e.g., due to question-answer coherence. Our findings support the idea that to evaluate TTS systems, especially in terms of prosody, we need to better understand what the expectations are being generated by the context. So, if we want to make more progress in TTS, it’s time to go beyond isolated audiobook sentences as the default evaluation!
University of Edinburgh