Multimodal Learning 相关度: 9/10

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun
arXiv: 2602.10815v1 发布: 2026-02-11 更新: 2026-02-11

AI 摘要

该论文解释了RL在VLM后训练中泛化性优于SFT的原因,并提出难度引导的SFT方法。

主要贡献

  • 揭示了数据难度对VLM泛化性能的影响
  • 提出了难度引导的SFT(DC-SFT)方法,提升OOD泛化能力
  • 证明DC-SFT在效率和性能上优于RL-based训练

方法论

通过系统性实验评估不同难度数据集上SFT模型的OOD泛化性能,并基于此提出难度过滤的DC-SFT方法。

原文摘要

The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

标签

VLM SFT Reinforcement Learning Generalization Data Difficulty

arXiv 分类

cs.CV cs.LG