Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
AI 摘要
提出用于多模态终身理解的MM-Lifelong数据集和递归多模态Agent(ReMA)模型,解决现有模型记忆瓶颈和全局定位崩溃问题。
主要贡献
- 构建了大规模多模态终身学习数据集MM-Lifelong
- 提出了递归多模态Agent(ReMA)模型,有效缓解记忆瓶颈和全局定位崩溃问题
- 建立了数据集分割方法,用于评估时序和领域偏见
方法论
ReMA采用动态记忆管理,迭代更新递归置信状态,以应对长时间跨度的多模态信息处理,从而提升模型性能。
原文摘要
While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.