AI Agents 相关度: 9/10

From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

Shuoling Liu, Zhiquan Tan, Kun Yi, Hui Wu, Yihan Li, Jiangpeng Yan, Liyuan Chen, Kai Chen, Qiang Yang
arXiv: 2603.25342v1 发布: 2026-03-26 更新: 2026-03-26

AI 摘要

论文提出了一种基于范畴论的深度研究智能体结构化评估方法,并构建了新的基准测试。

主要贡献

  • 提出基于范畴论的DRA行为建模方法
  • 构建了一个新的机制感知基准测试,包含296个问题
  • 揭示了现有DRA在结构化信息合成方面的不足

方法论

利用范畴论将深度研究工作流建模为结构保持映射的组合,并据此设计测试用例。

原文摘要

Although deep research agents (DRAs) have emerged as a promising paradigm for complex information synthesis, their evaluation remains constrained by ad hoc empirical benchmarks. These heuristic approaches do not rigorously model agent behavior or adequately stress-test long-horizon synthesis and ambiguity resolution. To bridge this gap, we formalize DRA behavior through the lens of category theory, modeling deep research workflow as a composition of structure-preserving maps (functors). Grounded in this theoretical framework, we introduce a novel mechanism-aware benchmark with 296 questions designed to stress-test agents along four interpretable axes: traversing sequential connectivity chains, verifying intersections within V-structure pullbacks, imposing topological ordering on retrieved substructures, and performing ontological falsification via the Yoneda Probe. Our rigorous evaluation of 11 leading models establishes a persistently low baseline, with the state-of-the-art achieving only a 19.9\% average accuracy, exposing the difficulty of formal structural stress-testing. Furthermore, our findings reveal a stark dichotomy in the current AI capabilities. While advanced deep research pipelines successfully redefine dynamic topological re-ordering and exhibit robust ontological verification -- matching pure reasoning models in falsifying hallucinated premises -- they almost universally collapse on multi-hop structural synthesis. Crucially, massive performance variance across tasks exposes a lingering reliance on brittle heuristics rather than a systemic understanding. Ultimately, this work demonstrates that while top-tier autonomous agents can now organically unify search and reasoning, achieving a generalized mastery over complex structural information remains a formidable open challenge.\footnote{Our implementation will be available at https://github.com/tzq1999/CDR.

标签

深度研究智能体 范畴论 基准测试 结构化评估 知识推理

arXiv 分类

cs.LG