Multimodal Learning 相关度: 8/10

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

Xiaofeng Yu, Jiaheng Dong, Jean Honorio, Abhirup Ghosh, Hong Jia, Ting Dang

arXiv: 2603.08230v1 发布: 2026-03-09 更新: 2026-03-09

下载 PDF arXiv 页面

AI 摘要

提出一种面向LALM的歧义情感识别方法，通过分布推理和链式思考提升模型对复杂情感的理解。

主要贡献

提出歧义感知目标函数，对齐预测与人类感知分布
提出结构化的歧义感知链式思考监督，引导情感线索推理
在IEMOCAP和CREMA-D数据集上验证了方法的有效性

方法论

重构歧义情感识别为分布推理问题，通过歧义感知目标和链式思考监督提升LALM的推理能力。

原文摘要

Speech emotion recognition plays an important role in various applications. However, most existing approaches predict a single emotion label, oversimplifying the inherently ambiguous nature of human emotional expression. Recent large audio-language models show promise in generating richer outputs, but their reasoning ability for ambiguous emotional understanding remains limited. In this work, we reformulate ambiguous emotion recognition as a distributional reasoning problem and present the first systematic study of ambiguity-aware reasoning in LALMs. Our framework comprises two complementary components: an ambiguity-aware objective that aligns predictions with human perceptual distributions, and a structured ambiguity-aware chain-of-thought supervision that guides reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and GRPO training strategies.

arXiv 分类

cs.SD cs.AI eess.AS

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类