AI Agents 相关度: 9/10

Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

Shutian Gu, Chengkai Huang, Ruoyu Wang, Lina Yao
arXiv: 2602.15724v1 发布: 2026-02-17 更新: 2026-02-17

AI 摘要

提出检索增强框架,提升LLM在视觉-语言导航中的效率和稳定性,无需微调LLM。

主要贡献

  • 提出episode-level instruction检索,提供任务先验
  • 提出step-level candidate检索,降低行动歧义
  • 实验证明检索增强有效提升导航性能

方法论

构建双层检索模块,分别在episode和step层面进行检索,为LLM提供上下文信息和候选行动过滤。

原文摘要

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigable directions before LLM inference, reducing action ambiguity and prompt complexity. Both retrieval modules are lightweight, modular, and trained independently of the LLM. We evaluate our method on the Room-to-Room (R2R) benchmark. Experimental results demonstrate consistent improvements in Success Rate, Oracle Success Rate, and SPL on both seen and unseen environments. Ablation studies further show that instruction-level exemplar retrieval and candidate pruning contribute complementary benefits to global guidance and step-wise decision efficiency. These results indicate that retrieval-augmented decision support is an effective and scalable strategy for enhancing LLM-based vision-and-language navigation.

标签

视觉语言导航 LLM 检索增强 Agent

arXiv 分类

cs.CV cs.AI