Multimodal Learning 相关度: 9/10

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan
arXiv: 2603.24558v1 发布: 2026-03-25 更新: 2026-03-25

AI 摘要

LensWalk通过让LLM自主控制视觉观察,提升了长视频理解的准确性、鲁棒性和可解释性。

主要贡献

  • 提出LensWalk框架,赋予LLM控制视频观察的能力
  • 通过reason-plan-observe循环动态调整视频观察范围和密度
  • 在长视频基准测试中显著提升了性能,无需模型微调

方法论

构建一个reason-plan-observe循环,LLM根据推理结果规划观察策略,并使用VLM工具进行观察和证据收集。

原文摘要

The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.

标签

Agent Video Understanding Vision-Language Model Reasoning

arXiv 分类

cs.CV cs.AI