OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
AI 摘要
OS-Themis框架通过分解轨迹和审计证据链,提升GUI智能体在强化学习中的奖励质量和可扩展性。
主要贡献
- 提出OS-Themis多智能体评论框架,提升GUI智能体奖励质量
- 引入OmniGUIRewardBench基准,用于评估GUI结果奖励
- 实验证明OS-Themis能有效提升在线RL训练和自训练循环中的agent性能
方法论
将轨迹分解为可验证的里程碑,采用审查机制严格审计证据链,进行奖励判断。
原文摘要
Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.