AI Agents 相关度: 8/10

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Hanbing Li, Long Chen, Zhi-Xin Yang, Jiwen Lu

arXiv: 2604.00813v1 发布: 2026-04-01 更新: 2026-04-01

下载 PDF arXiv 页面

AI 摘要

提出DVGT-2模型，用于端到端自动驾驶，通过在线方式输出密集几何信息和轨迹规划。

主要贡献

提出Vision-Geometry-Action范式，强调3D几何信息的重要性
设计流式DVGT-2模型，实现实时几何重建和规划
采用时间因果注意力和滑动窗口策略，提高效率

方法论

利用时间因果注意力机制和历史特征缓存，以滑动窗口流式处理输入，联合输出几何信息和轨迹。

原文摘要

End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.

arXiv 分类

cs.CV cs.AI cs.RO

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类