Multimodal Learning 相关度: 9/10

Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation

Lingsi Zhu, Yuefeng Zou, Yunxiang Zhang, Naixiang Zheng, Guoyuan Wang, Jun Yu, Jiaen Liang, Wei Huang, Shengping Liu, Ximin Zheng
arXiv: 2603.14976v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

TAEMI利用文本锚定和跨模态注意力机制,提升了情感模仿强度的估计精度,尤其在数据缺失情况下表现鲁棒。

主要贡献

  • 提出TAEMI框架,用于情感模仿强度估计
  • 引入Text-Anchored Dual Cross-Attention机制
  • 整合Learnable Missing-Modality Tokens和Modality Dropout策略,增强鲁棒性

方法论

TAEMI使用文本信息作为锚点,通过跨模态注意力机制过滤噪声,并采用数据增强方法应对数据缺失问题。

原文摘要

Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript--which inherently encode a stable, time-independent semantic prior--as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust textual queries to actively filter out frame-level redundancies and align the noisy physical streams. Furthermore, to prevent catastrophic performance degradation caused by inevitably missing data in unconstrained real-world scenarios, we integrate Learnable Missing-Modality Tokens and a Modality Dropout strategy during training. Extensive experiments on the Hume-Vidmimic2 dataset demonstrate that TAEMI effectively captures fine-grained emotional variations and maintains robust predictive resilience under imperfect conditions. Our framework achieves a state-of-the-art mean Pearson correlation coefficient across six continuous emotional dimensions, significantly outperforming existing baseline methods.

标签

情感计算 多模态融合 注意力机制 鲁棒性 模仿学习

arXiv 分类

cs.MM cs.CV