Multimodal Learning 相关度: 9/10

Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

Haoyuan Li, Rui Liu, Hehe Fan, Yi Yang

arXiv: 2603.09740v1 发布: 2026-03-10 更新: 2026-03-10

下载 PDF arXiv 页面

AI 摘要

SACA框架通过步进式对比对齐，从不完美轨迹中提取密集监督，提升VLN-CE任务性能。

主要贡献

提出Step-Aware Contrastive Alignment (SACA) 框架
设计感知步进式审计器评估每步进展
引入情景条件组构建机制动态路由批量数据

方法论

SACA利用步进式审计器识别轨迹中的有效前缀和分歧点，并基于此进行对比学习和优化。

原文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have driven recent progress, current training paradigms struggle to balance generalization capability, error recovery and training stability. Specifically, (i) policies derived from SFT suffer from compounding errors, struggling to recover from out-of-distribution states, and (ii) Reinforcement Fine-Tuning (RFT) methods e.g. GRPO are bottlenecked by sparse outcome rewards. Their binary feedback fails to assign credit to individual steps, leading to gradient signal collapse in failure dominant batches. To address these challenges, we introduce Step-Aware Contrastive Alignment (SACA), a framework designed to extract dense supervision from imperfect trajectories. At its core, the Perception-Grounded Step-Aware auditor evaluates progress step-by-step, disentangling failed trajectories into valid prefixes and exact divergence points. Leveraging these signals, Scenario-Conditioned Group Construction mechanism dynamically routes batches to specialized resampling and optimization strategies. Extensive experiments on VLN-CE benchmarks demonstrate that SACA achieves state-of-the-art performance.

arXiv 分类

cs.RO cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类