Progressive Residual Warmup for Language Model Pretraining
AI 摘要
ProRes提出了一种渐进式残差预热方法,通过调整层级预热顺序,提升语言模型预训练的稳定性与收敛速度。
主要贡献
- 提出了 Progressive Residual Warmup (ProRes) 方法
- 通过实验证明了 ProRes 在不同模型规模下的有效性
- 分析表明 ProRes 能够稳定预训练并带来更快的收敛和更好的泛化性能
方法论
ProRes通过调整每层残差连接的缩放系数,使得浅层先稳定学习,深层逐渐加入学习,从而优化训练过程。
原文摘要
Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.