Multi-Token Prediction via Self-Distillation
AI 摘要
通过自蒸馏将预训练语言模型转换为快速多token预测模型,无需额外组件。
主要贡献
- 提出了一种新的多token预测方法
- 无需训练额外的验证模型
- 使用在线蒸馏目标优化
方法论
利用在线蒸馏,将单token预测模型转换为多token预测模型,并保持模型结构不变。
原文摘要
Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $<5\%$ drop in accuracy relative to single token decoding performance.