LLM Reasoning 相关度: 9/10

Not All Negative Samples Are Equal: LLMs Learn Better from Plausible Reasoning

Zixiang Di, Jinyi Han, Shuo Zhang, Ying Liao, Zhi Li, Xiaofeng Ji, Yongqi Wang, Zheming Yang, Ming Gao, Bingdong Li, Jie Wang

arXiv: 2602.03516v1 发布: 2026-02-03 更新: 2026-02-03

下载 PDF arXiv 页面

AI 摘要

提出PNS方法，通过合成高质量负样本来提升LLM的推理能力。

主要贡献

提出了Plausible Negative Samples（PNS）方法
使用逆向强化学习生成高质量负样本
验证了PNS作为偏好优化的数据源的有效性

方法论

使用逆向强化学习训练模型，生成格式正确但答案错误的负样本，用于训练LLM。

原文摘要

Learning from negative samples holds great promise for improving Large Language Model (LLM) reasoning capability, yet existing methods treat all incorrect responses as equally informative, overlooking the crucial role of sample quality. To address this, we propose Plausible Negative Samples (PNS), a method that synthesizes high-quality negative samples exhibiting expected format and structural coherence while ultimately yielding incorrect answers. PNS trains a dedicated model via reverse reinforcement learning (RL) guided by a composite reward combining format compliance, accuracy inversion, reward model assessment, and chain-of-thought evaluation, generating responses nearly indistinguishable from correct solutions. We further validate PNS as a plug-and-play data source for preference optimization across three backbone models on seven mathematical reasoning benchmarks. Results demonstrate that PNS consistently outperforms other negative sample synthesis methods, achieving an average improvement of 2.03% over RL-trained models.

arXiv 分类

cs.LG cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类