Multimodal Learning 相关度: 9/10

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang
arXiv: 2602.05847v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

OmniVideo-R1通过查询意图和模态注意力增强音视频推理能力,提升了混合模态理解性能。

主要贡献

  • 提出基于自监督学习的查询式 grounding 方法
  • 提出基于对比学习的模态注意力融合方法
  • 在多个基准测试上超越现有模型

方法论

通过查询式 grounding 学习跨模态关联,并利用模态注意力融合不同模态信息,从而提升推理能力。

原文摘要

While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

标签

音视频理解 多模态学习 自监督学习 对比学习

arXiv 分类

cs.AI