Agent Tuning & Optimization 相关度: 8/10

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin

arXiv: 2602.12125v1 发布: 2026-02-12 更新: 2026-02-12

下载 PDF arXiv 页面

AI 摘要

提出了广义On-Policy蒸馏框架G-OPD，并通过奖励外推ExOPD和奖励校正改进学生模型性能。

主要贡献

理论证明OPD是KL约束RL的特例
提出广义On-Policy蒸馏框架G-OPD，包含奖励外推ExOPD和奖励校正
实验证明ExOPD和奖励校正能有效提升学生模型在数学推理和代码生成任务上的性能

方法论

理论分析OPD，然后提出G-OPD框架，通过调整奖励尺度和参考模型，优化知识蒸馏过程。

原文摘要

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.

arXiv 分类

cs.LG cs.AI cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类