Multimodal Learning 相关度: 9/10

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, Fuxi Wen

arXiv: 2603.01928v1 发布: 2026-03-02 更新: 2026-03-02

下载 PDF arXiv 页面

AI 摘要

提出LaST-VLA框架，通过潜在时空推理提升自动驾驶视觉-语言-动作模型性能，解决语义-感知解耦和感知-符号冲突。

主要贡献

提出Latent Spatio-Temporal CoT框架
引入双重特征对齐机制，从3D基础模型和世界模型中提取几何约束和动态预测信息
提出渐进式SFT训练策略和GRPO强化学习优化

方法论

利用双重特征对齐将几何约束和动态预测融入潜在空间，结合渐进式SFT和GRPO进行模型训练，实现物理层面上的时空推理。

原文摘要

While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类