AI Agents 相关度: 9/10

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

Ji Li, Jing Xia, Mingyi Li, Shiyan Hu
arXiv: 2602.15513v1 发布: 2026-02-17 更新: 2026-02-17

AI 摘要

提出一种结合情景记忆和语义记忆的非参数记忆框架,提升具身智能体在探索和问答任务中的性能。

主要贡献

  • 提出非参数情景记忆和语义记忆框架
  • 检索优先、推理辅助的情景记忆机制
  • 程序式规则提取的语义记忆机制

方法论

使用语义相似度检索情景记忆,通过视觉推理验证;将经验转换为结构化的语义记忆,实现跨环境泛化。

原文摘要

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.

标签

embodied AI memory multimodal question answering exploration

arXiv 分类

cs.RO cs.AI