Multimodal Learning 相关度: 9/10

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz
arXiv: 2603.18656v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

提出SCALe损失函数,通过动态权重解决VLM中CoT训练的token不平衡问题,提高推理精度和效率。

主要贡献

  • 提出SCALe损失函数,动态调整推理和答案部分的权重
  • 显著降低训练时间,效率提升
  • SCALe可作为独立方法或GRPO的基础,提升性能

方法论

SCALe-SFT通过余弦调度策略,在训练过程中逐步将重点从<think>转移到<answer>,鼓励简洁推理。

原文摘要

Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

标签

视觉语言模型 链式思考 损失函数 不平衡学习

arXiv 分类

cs.AI