Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
AI 摘要
研究了在数据受损情况下,离线多智能体强化学习从人类反馈中学习的鲁棒性问题。
主要贡献
- 提出了针对均匀覆盖和单边覆盖假设下的鲁棒估计器
- 设计了在单边覆盖假设下,求解粗略相关均衡的准多项式时间算法
- 首次系统性地处理了离线MARLHF中的对抗性数据损坏问题
方法论
利用线性马尔可夫博弈框架,通过设计鲁棒估计器和算法,实现了对数据损坏的抵抗。
原文摘要
We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset $D$ of trajectory-preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents' preferences), an $ε$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a uniform coverage assumption - where every policy of interest is sufficiently represented in the clean (prior to corruption) data - we introduce a robust estimator that guarantees an $O(ε^{1 - o(1)})$ bound on the Nash equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single-player deviations are covered. In this case, our proposed algorithm achieves an $O(\sqrtε)$ bound on the Nash gap. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral coverage regime, we derive a quasi-polynomial-time algorithm whose CCE gap scales as $O(\sqrtε)$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.