Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models
AI 摘要
视频微调能提升视频理解能力,但可能牺牲静态图像理解能力,存在时空理解的权衡。
主要贡献
- 系统研究了视频微调对MLLM时空理解能力的影响
- 发现视频微调存在时空理解的权衡,提升视频性能可能牺牲静态图像性能
- 提出了一种 instruction-aware Hybrid-Frame 策略,缓解了图像-视频的trade-off
方法论
通过在不同架构、参数规模和帧采样设置下进行实验,观察视频微调对MLLM视觉能力的影响。
原文摘要
Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.