AI Agents 相关度: 9/10

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen
arXiv: 2603.03194v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

论文提出了BeyondSWE基准测试代码智能体在跨库修复bug等现实场景下的能力,并探索了外部知识检索的提升效果。

主要贡献

  • 提出了BeyondSWE基准测试,评估代码智能体在更复杂场景下的能力
  • 开发了SearchSWE框架,用于评估搜索增强的效果
  • 揭示了现有代码智能体在复杂任务中的不足以及搜索增强的挑战

方法论

构建BeyondSWE基准,包含500个真实世界实例,通过SearchSWE框架集成深度搜索评估外部知识的作用。

原文摘要

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

标签

代码智能体 基准测试 外部知识 软件工程

arXiv 分类

cs.CL cs.SE