Multimodal Learning 相关度: 9/10

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

Xiaowen Zhang, Zhi Gao, Licheng Jiao, Lingling Li, Qing Li
arXiv: 2602.11730v1 发布: 2026-02-12 更新: 2026-02-12

AI 摘要

提出STVG-R1,通过视觉提示和强化学习,在时空视频定位任务上实现SOTA。

主要贡献

  • 提出基于视觉提示的STVG框架,避免跨模态对齐
  • 引入强化学习优化时序准确性、空间一致性和结构化格式
  • 在多个benchmark上达到SOTA,并具有zero-shot泛化能力

方法论

将坐标预测转化为实例识别,利用视觉提示嵌入ID,通过强化学习优化。

原文摘要

In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.

标签

时空视频定位 视觉提示 强化学习 多模态学习

arXiv 分类

cs.CV