Agent Tuning & Optimization 相关度: 8/10

Replaying pre-training data improves fine-tuning

Suhas Kotha, Percy Liang
arXiv: 2603.04964v1 发布: 2026-03-05 更新: 2026-03-05

AI 摘要

论文发现,在微调阶段重放预训练数据可显著提高目标任务的性能和数据效率。

主要贡献

  • 提出在微调阶段重放通用预训练数据的新方法
  • 量化了重放预训练数据在目标任务上的性能提升
  • 分析了不同数据schedule对重放效果的影响

方法论

通过控制预训练环境,对比有无重放预训练数据情况下的微调效果,并在实际任务中验证。

原文摘要

To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5\%$ and Basque question-answering accuracy by $2\%$.

标签

微调 预训练 数据效率 迁移学习

arXiv 分类

cs.CL cs.LG