Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling
AI 摘要
提出一种结合情景记忆和语义记忆的非参数记忆框架,提升具身智能体在探索和问答任务中的性能。
主要贡献
- 提出非参数情景记忆和语义记忆框架
- 检索优先、推理辅助的情景记忆机制
- 程序式规则提取的语义记忆机制
方法论
使用语义相似度检索情景记忆,通过视觉推理验证;将经验转换为结构化的语义记忆,实现跨环境泛化。
原文摘要
Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.