Agent Tuning & Optimization 相关度: 9/10

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei
arXiv: 2602.21158v1 发布: 2026-02-24 更新: 2026-02-24

AI 摘要

SELAUR提出了一种基于不确定性感知的奖励机制,提升LLM Agent的探索效率和学习稳定性。

主要贡献

  • 将LLM的不确定性整合到Agent的奖励设计中
  • 提出一种结合熵、最小置信度和边际的token级不确定性估计方法
  • 设计了Failure-aware的奖励重塑机制

方法论

SELAUR使用强化学习框架,将基于不确定性的奖励信号注入到step-和trajectory-level的奖励中,以改善Agent的表现。

原文摘要

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.

标签

AI Agent Reinforcement Learning Uncertainty Reward Shaping

arXiv 分类

cs.LG cs.CL