Multimodal Learning 相关度: 9/10

From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

Sining Ang, Yuguang Yang, Chenxu Dang, Canyu Chen, Cheng Chi, Haiyan Liu, Xuanyao Mao, Jason Bao, Xuliang, Bingchuan Sun, Yan Wang
arXiv: 2602.10719v1 发布: 2026-02-11 更新: 2026-02-11

AI 摘要

该论文研究了VLM和纯视觉backbone在端到端驾驶中的互补性,并提出了结合二者优势的混合驾驶方案。

主要贡献

  • 发现VLM和纯视觉backbone在驾驶行为上的差异性
  • 提出了HybridDriveVLA,结合VLM和纯视觉backbone的优势
  • 实现了DualDriveVLA,一种兼顾性能和效率的快速-慢速策略

方法论

通过实验对比VLM和纯视觉backbone的性能,利用oracle选择最佳轨迹,并设计学习器融合两种backbone的输出。

原文摘要

Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast--slow policy: it runs ViT by default and invokes the VLM only when the scorer's confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.

标签

自动驾驶 Vision-Language Model 端到端学习 多模态学习 行为规划

arXiv 分类

cs.RO cs.CV