Multimodal Learning 相关度: 9/10

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

Jiwook Han, Geo Ahn, Youngrae Kim, Jinwoo Choi
arXiv: 2603.25733v1 发布: 2026-03-26 更新: 2026-03-26

AI 摘要

提出SlotVTG,通过轻量级slot adapter提升MLLM在视频时序定位任务中的泛化能力。

主要贡献

  • 提出SlotVTG框架,利用slot attention进行对象中心视觉推理
  • 引入objectness priors鼓励语义一致的slot形成
  • 显著提升OOD泛化能力,同时保持ID性能

方法论

通过轻量级slot adapter将视觉token分解为抽象slots,并重构原始序列,实现对象中心视觉表示。

原文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

标签

视频时序定位 多模态学习 对象中心学习 视觉推理 泛化能力

arXiv 分类

cs.CV