AI Agents 相关度: 9/10

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Xavier Hu, Jinxiang Xia, Shengze Xu, Kangqi Song, Yishuo Yuan, Guibin Zhang, Jincheng Ren, Boyu Feng, Li Lu, Tieyong Zeng, Jiaheng Liu, Minghao Liu, Yuchen Elenor Jiang, Wei Wang, He Zhu, Wangchunshu Zhou
arXiv: 2602.09514v1 发布: 2026-02-10 更新: 2026-02-10

AI 摘要

EcoGym是一个评估LLM在交互式经济环境中长期规划能力的通用基准。

主要贡献

  • 提出了EcoGym基准测试环境
  • 统一的决策过程和标准化接口
  • 评估长期战略一致性和鲁棒性

方法论

通过三个不同的经济环境(Vending, Freelance, Operation)评估LLM的规划和执行能力,并分析其在不同场景下的表现。

原文摘要

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.

标签

LLM Agent Benchmark Planning Execution

arXiv 分类

cs.CL cs.AI