LLM Reasoning 相关度: 9/10

Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning

Jie Deng, Hanshuang Tong, Jun Li, Shining Liang, Ning Wu, Hongzhi Li, Yutao Xie
arXiv: 2602.04391v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

TrajFusion通过融合错误轨迹和反思提示,提升LLM数学推理能力。

主要贡献

  • 提出了TrajFusion,一种改进的微调策略
  • 将拒绝采样重新定义为结构化监督构建过程
  • 在数学推理基准测试中优于RFT

方法论

通过交错错误轨迹、反思提示和正确轨迹,构建融合轨迹,自适应控制轨迹长度,优化LLM微调。

原文摘要

Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.

标签

LLM Mathematical Reasoning Fine-tuning Trajectory Fusion

arXiv 分类

cs.CL