LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving
AI 摘要
提出LaST-VLA框架,通过潜在时空推理提升自动驾驶视觉-语言-动作模型性能,解决语义-感知解耦和感知-符号冲突。
主要贡献
- 提出Latent Spatio-Temporal CoT框架
- 引入双重特征对齐机制,从3D基础模型和世界模型中提取几何约束和动态预测信息
- 提出渐进式SFT训练策略和GRPO强化学习优化
方法论
利用双重特征对齐将几何约束和动态预测融入潜在空间,结合渐进式SFT和GRPO进行模型训练,实现物理层面上的时空推理。
原文摘要
While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.