Agent Tuning & Optimization 相关度: 8/10

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, Dongbin Zhao

arXiv: 2603.25562v1 发布: 2026-03-26 更新: 2026-03-26

下载 PDF arXiv 页面

AI 摘要

论文分析了On-Policy Distillation的失效模式，并提出了改进方法以提升LLM的训练稳定性与性能。

主要贡献

指出了 sampled-token OPD 的三种失效模式
提出了 teacher top-K local support matching 方法作为改进
实验证明改进方法在数学推理和多任务训练中表现更好

方法论

理论分析了token-level OPD的偏差和方差，并通过实验验证了其失效模式，并提出了truncated reverse-KL改进。

原文摘要

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

arXiv 分类

cs.LG cs.AI cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类