Multimodal Learning 相关度: 9/10

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Ziwei Liu
arXiv: 2602.08439v1 发布: 2026-02-09 更新: 2026-02-09

AI 摘要

提出Demo-ICL任务和基准,用于评估MLLM从视频演示中学习的能力,并提出Demo-ICL模型。

主要贡献

  • 定义了Demo-driven Video In-Context Learning任务
  • 构建了Demo-ICL-Bench基准数据集
  • 提出了Demo-ICL模型,采用两阶段训练策略

方法论

提出了Demo-ICL模型,通过视频监督微调和信息辅助直接偏好优化,增强MLLM从上下文示例中学习的能力。

原文摘要

Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.

标签

Video Understanding In-Context Learning Multimodal Learning

arXiv 分类

cs.CV