Multimodal Learning 相关度: 9/10

Vega: Learning to Drive with Natural Language Instructions

Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu
arXiv: 2603.25741v1 发布: 2026-03-26 更新: 2026-03-26

AI 摘要

提出了一种基于视觉-语言-世界-行动模型的自动驾驶方案,并构建了大规模指令驾驶数据集。

主要贡献

  • 构建了包含多样指令的InstructScene数据集
  • 提出了统一的视觉-语言-世界-行动模型Vega
  • 结合自回归和扩散范式进行预测和轨迹生成

方法论

使用自回归处理视觉和语言输入,扩散范式生成未来预测和轨迹,通过联合注意力实现模态交互。

原文摘要

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

标签

自动驾驶 视觉语言模型 指令跟随 运动规划 World Model

arXiv 分类

cs.CV cs.AI cs.RO