On the Emotion Understanding of Synthesized Speech
AI 摘要
现有语音情感识别模型难以泛化到合成语音,因为合成语音与人类语音存在表征不匹配,且SLM倾向于从文本语义推断情感。
主要贡献
- 揭示了语音情感识别模型在合成语音上的泛化性问题
- 指出现有SER模型利用非鲁棒的捷径而非捕捉根本特征
- 发现SLM在理解语音中的情感信息方面存在挑战
方法论
通过在不同数据集、模型和合成模型上系统评估语音情感识别(SER)的性能来进行研究。
原文摘要
Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.