Agent Tuning & Optimization 相关度: 8/10

Efficient and Stable Reinforcement Learning for Diffusion Language Models

Jiawei Liu, Xiting Wang, Yuanyuan Zhong, Defu Lian, Yu Yang

arXiv: 2602.08905v1 发布: 2026-02-09 更新: 2026-02-09

下载 PDF arXiv 页面

AI 摘要

提出Spatio-Temporal Pruning(STP)框架，提升基于扩散模型的LLM的强化学习效率和稳定性。

主要贡献

提出Spatio-Temporal Pruning (STP) 框架
通过空间剪枝和时间剪枝压缩生成过程中的冗余
理论分析证明STP降低了log-likelihood估计的方差，确保了更稳定的策略更新

方法论

通过空间剪枝约束探索空间，利用时间剪枝跳过冗余的后期优化步骤，从而提升效率和稳定性。

原文摘要

Reinforcement Learning (RL) is crucial for unlocking the complex reasoning capabilities of Diffusion-based Large Language Models (dLLMs). However, applying RL to dLLMs faces unique challenges in efficiency and stability. To address these challenges, we propose Spatio-Temporal Pruning (STP), a framework designed to simultaneously improve the efficiency and stability of RL for dLLMs. STP compresses the redundancy in the generative process through: (1) \textit{spatial pruning}, which constrains the exploration space using static priors; and (2) \textit{temporal pruning}, which bypasses redundant late-stage refinement steps. Our theoretical analysis demonstrates that STP strictly reduces the variance of the log-likelihood estimation, thereby ensuring more stable policy updates. Extensive experiments demonstrate that STP surpasses state-of-the-art baselines in both efficiency and accuracy. Our code is available at https://github.com/Lolo1222/STP.

arXiv 分类

cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类