AI Agents 相关度: 9/10

SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, Lijie Hu
arXiv: 2603.15401v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

评估Agent Skills在真实软件工程任务中的有效性,发现其收益有限且依赖领域和上下文。

主要贡献

  • 提出了SWE-Skills-Bench,一个评估Agent Skills在软件工程中作用的基准测试。
  • 构建了一个确定性的验证框架,用于评估技能对代码生成的影响。
  • 实验结果表明,大多数技能对性能提升有限,甚至可能降低性能。

方法论

构建基准测试,将Agent Skills应用于真实的GitHub仓库,并通过确定性测试验证技能对代码质量的影响。

原文摘要

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.

标签

Agent Skills Software Engineering Benchmark Evaluation LLM

arXiv 分类

cs.SE cs.AI