Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling
AI 摘要
提出一种基于双向跨注意力和时间建模的多模态情感识别框架,提升了非约束环境下的情感识别性能。
主要贡献
- 提出基于CLIP和Wav2Vec 2.0的视觉和音频特征提取方法
- 设计双向跨注意力融合模块,增强跨模态上下文信息
- 引入文本引导的对比目标,鼓励语义对齐的视觉表示
方法论
采用预训练模型提取多模态特征,利用TCN建模时间依赖,并通过双向跨注意力进行特征融合。
原文摘要
Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.