Multimodal Learning 相关度: 9/10

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang

arXiv: 2602.05847v1 发布: 2026-02-05 更新: 2026-02-05

下载 PDF arXiv 页面

AI 摘要

OmniVideo-R1通过查询意图和模态注意力增强音视频推理能力，提升了混合模态理解性能。

主要贡献

提出基于自监督学习的查询式 grounding 方法
提出基于对比学习的模态注意力融合方法
在多个基准测试上超越现有模型

方法论

通过查询式 grounding 学习跨模态关联，并利用模态注意力融合不同模态信息，从而提升推理能力。

原文摘要

While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

arXiv 分类

cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类