Not All Negative Samples Are Equal: LLMs Learn Better from Plausible Reasoning
AI 摘要
提出PNS方法,通过合成高质量负样本来提升LLM的推理能力。
主要贡献
- 提出了Plausible Negative Samples(PNS)方法
- 使用逆向强化学习生成高质量负样本
- 验证了PNS作为偏好优化的数据源的有效性
方法论
使用逆向强化学习训练模型,生成格式正确但答案错误的负样本,用于训练LLM。
原文摘要
Learning from negative samples holds great promise for improving Large Language Model (LLM) reasoning capability, yet existing methods treat all incorrect responses as equally informative, overlooking the crucial role of sample quality. To address this, we propose Plausible Negative Samples (PNS), a method that synthesizes high-quality negative samples exhibiting expected format and structural coherence while ultimately yielding incorrect answers. PNS trains a dedicated model via reverse reinforcement learning (RL) guided by a composite reward combining format compliance, accuracy inversion, reward model assessment, and chain-of-thought evaluation, generating responses nearly indistinguishable from correct solutions. We further validate PNS as a plug-and-play data source for preference optimization across three backbone models on seven mathematical reasoning benchmarks. Results demonstrate that PNS consistently outperforms other negative sample synthesis methods, achieving an average improvement of 2.03% over RL-trained models.