AI Agents 相关度: 8/10

MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation

Dekang Qi, Shuang Zeng, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Mu Xu
arXiv: 2602.05467v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

提出MerNav框架,利用记忆、执行和回顾模块,提升零样本目标导航的成功率和泛化性。

主要贡献

  • 提出Memory-Execute-Review (MerNav) 框架
  • 在四个数据集上验证了框架的有效性,显著提升了零样本设定下的成功率
  • 在部分数据集上超越了监督微调方法,实现了成功率和泛化性的双重领先

方法论

构建分层记忆模块提供信息支持,执行模块进行决策,回顾模块处理异常并纠正行为。

原文摘要

Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.

标签

Visual Language Navigation Zero-Shot Learning Object Goal Navigation

arXiv 分类

cs.CV cs.CL cs.RO