Multimodal Learning 相关度: 9/10

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, Fuxi Wen
arXiv: 2603.01928v1 发布: 2026-03-02 更新: 2026-03-02

AI 摘要

提出LaST-VLA框架,通过潜在时空推理提升自动驾驶视觉-语言-动作模型性能,解决语义-感知解耦和感知-符号冲突。

主要贡献

  • 提出Latent Spatio-Temporal CoT框架
  • 引入双重特征对齐机制,从3D基础模型和世界模型中提取几何约束和动态预测信息
  • 提出渐进式SFT训练策略和GRPO强化学习优化

方法论

利用双重特征对齐将几何约束和动态预测融入潜在空间,结合渐进式SFT和GRPO进行模型训练,实现物理层面上的时空推理。

原文摘要

While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.

标签

自动驾驶 视觉语言动作模型 潜在空间推理 时空推理

arXiv 分类

cs.CV