AI Agents 相关度: 9/10

Learning Partial Action Replacement in Offline MARL

Yue Jin, Giovanni Montana
arXiv: 2603.28573v1 发布: 2026-03-30 更新: 2026-03-30

AI 摘要

提出PLCQL,一种基于上下文bandit的离线MARL部分动作替换方法,提升效率和性能。

主要贡献

  • 提出基于上下文bandit的部分动作替换策略
  • 使用不确定性加权奖励的PPO学习策略
  • 证明了估计误差与偏差agent数量线性相关

方法论

将部分动作替换建模为上下文bandit问题,使用不确定性加权奖励的PPO学习状态相关的替换策略。

原文摘要

Offline multi-agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out-of-distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with an uncertainty-weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value-error bound showing that the estimation error scales linearly with the expected number of deviating agents. Compared with the previous PAR-based method SPaCQL, PLCQL reduces the number of per-iteration Q-function evaluations from n to 1, significantly improving computational efficiency. Empirically, PLCQL achieves the highest normalised scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.

标签

Offline MARL Multi-Agent Reinforcement Learning Partial Action Replacement Contextual Bandit

arXiv 分类

cs.LG cs.AI cs.MA