AI Agents 相关度: 9/10

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu, Yuchen He, Zhiyuan Ning, Chen Yijun, Wenge Que, Li Shi
arXiv: 2603.02701v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

Graph-GRPO通过群组相对策略优化稳定多智能体拓扑学习,提升通信效率。

主要贡献

  • 提出Graph-GRPO框架,优化多智能体通信拓扑
  • 引入群组相对策略优化,降低梯度方差和解决信用分配问题
  • 在推理和代码生成任务上优于现有方法,提升训练稳定性和识别关键路径

方法论

通过对每个查询采样多个通信图,计算边在组内的相对性能优势,进行奖励归一化。

原文摘要

Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

标签

多智能体系统 拓扑学习 强化学习 相对策略优化

arXiv 分类

cs.CL