AI Agents 相关度: 9/10

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

Ayush Nangia, Shikhar Mishra, Aman Gokrani, Paras Chopra
arXiv: 2602.19594v1 发布: 2026-02-23 更新: 2026-02-23

AI 摘要

ISO-Bench评估编码智能体在真实推理工作负载上的优化能力,结合硬性和软性指标。

主要贡献

  • 提出ISO-Bench基准测试,评估编码智能体优化真实推理任务
  • 结合执行和LLM的硬性和软性指标进行综合评估
  • 发现不同智能体在不同代码库上表现各异,且提示词框架至关重要

方法论

通过真实推理优化任务,评估编码智能体生成的补丁质量,并结合执行和LLM评估。

原文摘要

We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly, agents often identify correct bottlenecks but fail to execute working solutions. We also show that agents with identical underlying models differ substantially, suggesting scaffolding is as important as the model.

标签

benchmark coding agent inference optimization LLM serving evaluation metrics

arXiv 分类

cs.LG