AI Agents 相关度: 9/10

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu
arXiv: 2604.01221v1 发布: 2026-04-01 更新: 2026-04-01

AI 摘要

HippoCamp是一个评估Agent在个人电脑环境中文件管理能力的benchmark,揭示了现有Agent的不足。

主要贡献

  • 提出HippoCamp benchmark,评估Agent在个人电脑环境中的文件管理能力
  • 构建包含多样模态文件的大规模数据集,用于评估Agent的搜索、理解和推理能力
  • 通过详细的错误诊断分析,识别多模态感知和证据 grounding 是主要瓶颈

方法论

构建包含真实用户文件系统的benchmark,包含QA和结构化轨迹,评估MLLM和Agent在搜索、理解、推理任务上的性能。

原文摘要

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

标签

Agent Multimodal Benchmark File Management User-centric

arXiv 分类

cs.AI cs.CV