Agent Tuning & Optimization 相关度: 9/10

Fibration Policy Optimization

Chang Li, Tshihao Tsu, Yaren Zhang, Chao Xue, Xiaodong He
arXiv: 2603.08239v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

提出Fibration Policy Optimization(FiberPO),一个统一LLM策略优化的多尺度稳定控制框架。

主要贡献

  • 提出了Aggregational Policy Censoring Objective (APC-Obj)
  • 开发了Fiber Bundle Gating (FBG) 框架
  • 提出了Fibration Policy Optimization (FiberPO) 算法
  • 设计了 Fibration Gating Hierarchy (FGH)

方法论

基于TV-TRPO推导出APC-Obj,构建fiber bundle分解ratio gating,设计FiberPO目标函数,并拓展到分层结构FGH。

原文摘要

Large language models are increasingly trained as heterogeneous systems spanning multiple domains, expert partitions, and agentic pipelines, yet prevalent proximal objectives operate at a single scale and lack a principled mechanism for coupling token-level, trajectory-level, and higher-level hierarchical stability control. To bridge this gap, we derive the Aggregational Policy Censoring Objective (APC-Obj), the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem. Building on this foundation, we develop Fiber Bundle Gating (FBG), an algebraic framework that organizes sampled RL data as a fiber bundle and decomposes ratio gating into a base-level gate on trajectory aggregates and a fiber-level gate on per-token residuals, with provable first-order agreement with the true RL objective near on-policy. From APC-Obj and FBG we derive Fibration Policy Optimization (or simply, FiberPO), a concrete objective whose Jacobian is block-diagonal over trajectories, reduces to identity at on-policy, and provides better update direction thus improving token efficiency. The compositional nature of the framework extends beyond the trajectory-token case: fibrations compose algebraically into a Fibration Gating Hierarchy (FGH) that scales the same gating mechanism to arbitrary hierarchical depth without new primitives, as demonstrated by FiberPO-Domain, a four-level instantiation with independent trust-region budgets at the domain, prompt group, trajectory, and token levels. Together, these results connect the trust-region theory, a compositional algebraic structure, and practical multi-scale stability control into a unified framework for LLM policy optimization.

标签

LLM Policy Optimization Reinforcement Learning Trust Region

arXiv 分类

cs.LG cs.AI cs.CL