AI Agents 相关度: 7/10

VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

Yuheng Lei, Zhixuan Liang, Hongyuan Zhang, Ping Luo
arXiv: 2603.04910v1 发布: 2026-03-05 更新: 2026-03-05

AI 摘要

VPWEM利用工作记忆和情景记忆,提升视觉运动策略在非马尔可夫任务中的表现。

主要贡献

  • 提出VPWEM,一种具备工作记忆和情景记忆的非马尔可夫视觉运动策略
  • 引入基于Transformer的上下文记忆压缩器,递归地将观测转化为情景记忆
  • VPWEM在MIKASA和MoMaRT等记忆密集型任务中表现优于现有方法

方法论

VPWEM通过滑动窗口保留短期工作记忆,利用Transformer压缩器将历史观测转化为情景记忆,并联合训练策略。

原文摘要

Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real-time constraints in robotic systems. By contrast, humans can compress important past experiences into long-term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non-Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short-term working memory, and introduces a Transformer-based contextual memory compressor that recursively converts out-of-window observations into a fixed number of episodic memory tokens. The compressor uses self-attention over a cache of past summary tokens and cross-attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short-term and episode-wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action (VLA) models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.

标签

Imitation Learning Robotics Memory Visuomotor Policy Transformer

arXiv 分类

cs.RO cs.AI cs.LG