Multimodal Learning 相关度: 9/10

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

arXiv: 2602.23300v1 发布: 2026-02-26 更新: 2026-02-26

下载 PDF arXiv 页面

AI 摘要

提出了一个用于对话情绪识别的混合专家模型MiSTER-E，有效融合语音和文本信息。

主要贡献

提出了MiSTER-E模型，解耦了模态特定上下文建模和多模态信息融合
引入了监督对比损失和KL散度正则化，增强模态一致性
实验证明MiSTER-E在多个benchmark数据集上优于现有方法

方法论

利用预训练LLM提取语音和文本嵌入，通过卷积-循环层建模上下文，使用门控机制动态融合专家预测。

原文摘要

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

arXiv 分类

cs.CL eess.AS

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类