Assessing the impact of contextual framing on subjective TTS quality

#1 Assessing the impact of contextual framing on subjective TTS quality [PDF] [Copy] [Kimi] [REL]

Authors: Jens Edlund, Christina Tånnander, Sébastien Le Maguer, Petra Wagner

Text-To-Speech (TTS) evaluations are habitually carried out without contextual and situational framing. Since humans adapt their speaking style to situation specific communicative needs, such evaluations may not generalize across situations. Without clearly defined framing, it is even unclear in which situations evaluation results hold at all. We test the hypothesized impact of framing on TTS evaluation in a crowdsourced MOS evaluation of four TTS voices, systematically varying (a) the intended TTS task (domestic humanoid robot, child’s voice replacement, fiction audio books and long and information-rich texts) and (b) the framing of that task. The results show that framing differentiated MOS responses, with individual TTS performance varying significantly across tasks and framings. This corroborates the assumption that decontextualized MOS evaluations do not generalize, and suggests that TTS evaluations should not be reported without the type of framing that was employed, if any.

Subject: INTERSPEECH.2024 - Speech Synthesis

edlund24@interspeech_2024@ISCA

#1 Assessing the impact of contextual framing on subjective TTS quality [PDF] [Copy] [Kimi] [REL]