LLM Reasoning 相关度: 9/10

QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

Doyeon Lee, Eunyi Lyou, Hyunsoo Cho, Sookyung Kim, Joonseok Lee, Jaemoo Choi
arXiv: 2602.04620v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

QUATRO通过直接强制执行信任域约束,实现LLM策略优化的稳定和可控。

主要贡献

  • 提出Query-Adaptive Trust-Region Policy Optimization (QUATRO)算法
  • 通过原则性优化直接强制执行信任域约束
  • 在数学推理任务上验证了QUATRO的稳定性和有效性

方法论

采用GRPO-style强化学习方法,通过直接优化信任域约束,避免启发式近似带来的问题,实现更稳定的训练。

原文摘要

GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.

标签

LLM Fine-tuning Reinforcement Learning Trust Region Policy Optimization Mathematical Reasoning

arXiv 分类

cs.LG