Multimodal Learning 相关度: 9/10

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, Chris Donahue
arXiv: 2602.11103v1 发布: 2026-02-11 更新: 2026-02-11

AI 摘要

GameDevBench是一个评估智能体游戏开发能力的多模态基准测试。

主要贡献

  • 提出了GameDevBench基准测试,用于评估智能体在游戏开发中的能力。
  • 定义了132个基于教程的游戏开发任务,需要多模态理解和复杂代码实现。
  • 引入了图像和视频反馈机制,以提高智能体的多模态理解能力。

方法论

构建了一个包含132个游戏开发任务的基准测试,并引入视觉反馈机制来提升智能体表现。

原文摘要

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.

标签

game development benchmark multimodal learning AI agents

arXiv 分类

cs.AI cs.CL cs.SE