Multimodal Learning 相关度: 9/10

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

Zihan You, Hongwei Liu, Chenxu Dang, Zhe Wang, Sining Ang, Aoqi Wang, Yan Wang
arXiv: 2603.08113v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

提出了场景自适应的混合专家VLA模型SAMoE-VLA,用于提升自动驾驶决策的稳定性和安全性。

主要贡献

  • 提出了场景自适应的混合专家机制,基于BEV特征进行专家选择
  • 引入了条件跨模态因果注意力机制,整合世界状态、语言意图和行动历史
  • 在nuScenes和LangAuto数据集上取得了SOTA性能

方法论

使用BEV特征作为MoE路由信号,并设计条件跨模态因果注意力机制,实现场景感知和时序一致的推理。

原文摘要

Recent advances in Vision-Language-Action (VLA) models have shown promising capabilities in autonomous driving by leveraging the understanding and reasoning strengths of Large Language Models(LLMs).However, our empirical analysis reveals that directly applying existing token-level MoE mechanisms--which are inherited from LLM architectures--to VLA models results in unstable performance and safety degradation in autonomous driving, highlighting a misalignment between token-based expert specialization and scene-level decision-making.To address this, we propose SAMoE-VLA, a scene-adaptive Vision-Language-Action framework that conditions expert selection on structured scene representations instead of token embeddings. Our key idea is to derive the MoE routing signal from bird's-eye-view (BEV) features that encapsulates traffic scene context, enabling scenario-dependent expert weighting and merging tailored to distinct driving conditions. Furthermore, to support temporally consistent reasoning across world-knowledge, perception, language, and action, we introduce a Conditional Cross-Modal Causal Attention mechanism that integrates world state, linguistic intent, and action history into a unified causal reasoning process. Extensive experiments on the nuScenes open loop planning dataset and LangAuto closed-loop benchmark demonstrate that SAMoE-VLA achieves state-of-the-art performance, outperforming prior VLA-based and world-model-based approaches with fewer parameters.Our code will be released soon.

标签

自动驾驶 多模态学习 混合专家模型 场景理解 因果推理

arXiv 分类

cs.CV