Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
AI 摘要
提出了Reflective Test-Time Planning,通过反思改进具身LLM的决策,提升任务完成能力。
主要贡献
- 引入Reflection-in-action和Reflection-on-action两种反思模式
- 提出Retrospective Reflection,实现长时程信用分配
- 设计了Long-Horizon Household和MuJoCo Cupboard Fitting benchmark
方法论
通过测试时缩放生成候选动作、执行后更新模型,并结合回顾性反思,实现智能体在试错中学习。
原文摘要
Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.