AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science
AI 摘要
AgentDS基准测试了领域特定数据科学中人机协作的未来,结果表明人机协作优于纯AI。
主要贡献
- 提出了AgentDS基准测试,用于评估AI agent和人机协作在领域特定数据科学中的表现。
- 构建了包含六个行业的17个挑战的数据集。
- 通过竞赛,系统性地比较了人机协作方法和纯AI基线。
方法论
设计包含跨领域数据科学任务的基准测试,组织竞赛,对比人机协作与纯AI的表现。
原文摘要
Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .