Multimodal Learning 相关度: 9/10

Are Video Reasoning Models Ready to Go Outside?

Yangfan He, Changgyu Boo, Jaehong Yoon
arXiv: 2603.10652v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

提出ROVA框架,增强视频理解模型在真实扰动下的鲁棒性,并构建了PVRBench基准测试。

主要贡献

  • 提出ROVA训练框架,提升模型在扰动环境下的鲁棒性
  • 引入难度感知在线训练策略,自适应选择信息量大的样本
  • 构建PVRBench基准测试,评估真实扰动下的模型性能

方法论

ROVA通过建模鲁棒性一致性奖励,在时空扰动下进行训练。采用难度感知在线训练,根据模型能力动态调整样本难度。

原文摘要

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

标签

视频理解 鲁棒性 多模态 视觉语言模型 扰动

arXiv 分类

cs.CV cs.AI