Multimodal Learning 相关度: 9/10

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Peng Sun, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li
arXiv: 2603.09715v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

提出CVS方法,通过评估问题对答案有效性的影响,实现视觉语言SFT的无训练数据选择。

主要贡献

  • 提出CVS,一种无训练的数据选择方法
  • 利用问题对答案有效性影响来评估样本质量
  • 在Vision-Flan和The Cauldron数据集上验证了CVS的有效性

方法论

利用冻结的VLLM评估图像-答案对在有无问题条件下的有效性差异,筛选需要视觉语言联合推理的样本。

原文摘要

Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.

标签

视觉语言 数据选择 无训练 指令微调

arXiv 分类

cs.AI