OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
AI 摘要
OmniVideo-R1通过查询意图和模态注意力增强音视频推理能力,提升了混合模态理解性能。
主要贡献
- 提出基于自监督学习的查询式 grounding 方法
- 提出基于对比学习的模态注意力融合方法
- 在多个基准测试上超越现有模型
方法论
通过查询式 grounding 学习跨模态关联,并利用模态注意力融合不同模态信息,从而提升推理能力。
原文摘要
While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.