Multimodal Learning 相关度: 8/10

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai
arXiv: 2603.25716v1 发布: 2026-03-26 更新: 2026-03-26

AI 摘要

针对视频世界模型中动态物体遮挡问题,提出混合记忆和新数据集,实现更好的动态物体建模。

主要贡献

  • 提出混合记忆机制,区分静态背景和动态物体
  • 构建HM-World数据集,用于评估混合记忆模型
  • 提出HyDRA模型,利用时空相关性进行记忆检索

方法论

提出Hybrid Memory机制,并设计HyDRA模型,通过时空相关性进行记忆压缩和检索,以保持动态物体的一致性。

原文摘要

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

标签

视频世界模型 记忆机制 动态物体建模 数据集 视频生成

arXiv 分类

cs.CV cs.AI