AI Agents 相关度: 9/10

ProxyWar: Dynamic Assessment of LLM Code Generation in Game Arenas

Wenjun Peng, Xinyu Wang, Qi Wu
arXiv: 2602.04296v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

ProxyWar框架通过竞争性游戏环境动态评估LLM代码生成质量,发现传统评估方法的局限性。

主要贡献

  • 提出ProxyWar框架,用于动态评估LLM代码生成
  • 揭示静态benchmark与实际游戏环境性能的差异
  • 为LLM驱动的算法发现和自适应问题求解奠定基础

方法论

构建包含多种竞争性游戏环境的评估框架,LLM生成的智能体在其中进行迭代测试、修复和多智能体竞赛。

原文摘要

Large language models (LLMs) have revolutionized automated code generation, yet the evaluation of their real-world effectiveness remains limited by static benchmarks and simplistic metrics. We present ProxyWar, a novel framework that systematically assesses code generation quality by embedding LLM-generated agents within diverse, competitive game environments. Unlike existing approaches, ProxyWar evaluates not only functional correctness but also the operational characteristics of generated programs, combining automated testing, iterative code repair, and multi-agent tournaments to provide a holistic view of program behavior. Applied to a range of state-of-the-art coders and games, our approach uncovers notable discrepancies between benchmark scores and actual performance in dynamic settings, revealing overlooked limitations and opportunities for improvement. These findings highlight the need for richer, competition-based evaluation of code generation. Looking forward, ProxyWar lays a foundation for research into LLM-driven algorithm discovery, adaptive problem solving, and the study of practical efficiency and robustness, including the potential for models to outperform hand-crafted agents. The project is available at https://github.com/xinke-wang/ProxyWar.

标签

LLM Code Generation Evaluation Game AI

arXiv 分类

cs.SE cs.AI