BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?
AI 摘要
论文提出了BeyondSWE基准测试代码智能体在跨库修复bug等现实场景下的能力,并探索了外部知识检索的提升效果。
主要贡献
- 提出了BeyondSWE基准测试,评估代码智能体在更复杂场景下的能力
- 开发了SearchSWE框架,用于评估搜索增强的效果
- 揭示了现有代码智能体在复杂任务中的不足以及搜索增强的挑战
方法论
构建BeyondSWE基准,包含500个真实世界实例,通过SearchSWE框架集成深度搜索评估外部知识的作用。
原文摘要
Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.