Multimodal Learning 相关度: 9/10

Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

Junhyeong Byeon, Jeongyeol Kim, Sejoon Lim
arXiv: 2603.11971v1 发布: 2026-03-12 更新: 2026-03-12

AI 摘要

提出一种基于双向跨注意力和时间建模的多模态情感识别框架,提升了非约束环境下的情感识别性能。

主要贡献

  • 提出基于CLIP和Wav2Vec 2.0的视觉和音频特征提取方法
  • 设计双向跨注意力融合模块,增强跨模态上下文信息
  • 引入文本引导的对比目标,鼓励语义对齐的视觉表示

方法论

采用预训练模型提取多模态特征,利用TCN建模时间依赖,并通过双向跨注意力进行特征融合。

原文摘要

Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.

标签

情感识别 多模态学习 跨注意力 时间建模

arXiv 分类

cs.CV cs.AI