LLM Reasoning 相关度: 9/10

Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning

Rafael Pardinas, Ehsan Kamalloo, David Vazquez, Alexandre Drouin

arXiv: 2604.02007v1 发布: 2026-04-02 更新: 2026-04-02

下载 PDF arXiv 页面

AI 摘要

提出Apriel-Reasoner，通过可复现的RL后训练方法，提升通用推理能力并降低推理成本。

主要贡献

可复现的多领域RL后训练方法
自适应领域采样机制
难度感知长度惩罚
降低推理token成本

方法论

在Apriel-Base上使用RL进行后训练，结合自适应领域采样和难度感知的长度惩罚，优化模型在多个推理任务上的表现。

原文摘要

Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.

arXiv 分类

cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类