Multimodal Learning 相关度: 9/10

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Jun Yu, Naixiang Zheng, Guoyuan Wang, Yunxiang Zhang, Lingsi Zhu, Jiaen Liang, Wei Huang, Shengping Liu
arXiv: 2603.08034v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

针对ABAWE表情识别挑战,提出了一种鲁棒的多模态框架,有效处理模态缺失和数据不平衡问题。

主要贡献

  • 提出基于安全交叉注意力和模态Dropout的多模态框架
  • 采用Focal Loss和滑动窗口软投票策略缓解数据不平衡
  • 在Aff-Wild2验证集上取得了较好的结果

方法论

使用双分支Transformer架构,融合视觉和音频特征,通过安全交叉注意力和模态Dropout处理模态缺失,并优化训练策略。

原文摘要

Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex spatiotemporal dependencies, achieving an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set.

标签

多模态学习 情感识别 Transformer 交叉注意力 数据不平衡

arXiv 分类

cs.CV cs.AI