LLM Reasoning 相关度: 9/10

BFS-PO: Best-First Search for Large Reasoning Models

Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara

arXiv: 2602.14917v1 发布: 2026-02-16 更新: 2026-02-16

下载 PDF arXiv 页面

AI 摘要

BFS-PO算法利用最佳优先搜索策略，缩短大型推理模型的推理链，提高准确率并减少冗余输出。

主要贡献

提出BFS-PO算法，解决LRM的过度推理问题
使用最大熵节点的回溯机制，寻找最短正确答案
在多个基准测试上验证了算法的有效性，提高准确率并缩短答案

方法论

BFS-PO使用最佳优先搜索探索策略，结合最大熵节点的回溯机制，生成逐步缩短的训练响应，学习简洁推理链。

原文摘要

Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.

arXiv 分类

cs.CL cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类