AI Agents 相关度: 9/10

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, Yotam Perlitz

arXiv: 2603.29399v1 发布: 2026-03-31 更新: 2026-03-31

下载 PDF arXiv 页面

AI 摘要

发现ELT-Bench基准测试质量问题，低估了AI Agent在ELT流水线构建中的能力。

主要贡献

揭示了ELT-Bench基准测试的质量问题
提出了Auditor-Corrector方法用于基准测试质量审计
构建了ELT-Bench-Verified，一个改进后的基准测试

方法论

使用Auditor-Corrector方法，结合LLM驱动的根因分析和人工验证，审计并修正ELT-Bench的错误。

原文摘要

Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors -- including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth -- that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.

arXiv 分类

cs.AI cs.DB

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类