Multimodal Learning 相关度: 9/10

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Linghao Zhang, Jungang Li, Yonghua Hei, Sicheng Tao, Song Dai, Yibo Yan, Zihao Dongfang, Weiting Liu, Chenxi Qin, Hanqian Li, Xin Zou, Jiahao Zhang, Shuhang Xun, Haiyun Jiang, Xuming Hu
arXiv: 2603.17541v1 发布: 2026-03-18 更新: 2026-03-18

AI 摘要

视频微调能提升视频理解能力,但可能牺牲静态图像理解能力,存在时空理解的权衡。

主要贡献

  • 系统研究了视频微调对MLLM时空理解能力的影响
  • 发现视频微调存在时空理解的权衡,提升视频性能可能牺牲静态图像性能
  • 提出了一种 instruction-aware Hybrid-Frame 策略,缓解了图像-视频的trade-off

方法论

通过在不同架构、参数规模和帧采样设置下进行实验,观察视频微调对MLLM视觉能力的影响。

原文摘要

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.

标签

MLLM Video-SFT Multimodal Learning Spatial Understanding Temporal Understanding

arXiv 分类

cs.CV