AI Agents 相关度: 8/10

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Zisu Huang, Muzhao Tian, Shihan Dou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang, Jie Zhou
arXiv: 2602.03619v1 发布: 2026-02-03 更新: 2026-02-03

AI 摘要

提出一种基于人类偏好的查询特定评估标准生成方法,用于提升深度研究报告的生成质量。

主要贡献

  • 构建了深度研究风格查询及人类偏好标注的数据集
  • 提出使用混合奖励强化学习训练评估标准生成器
  • 引入多智能体马尔可夫状态工作流(MaMs)提升报告生成效果

方法论

通过强化学习训练评估标准生成器,结合人类偏好监督和LLM评估,并采用MaMs工作流优化长程推理。

原文摘要

Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.

标签

深度研究报告 评估标准生成 强化学习 人类偏好 多智能体

arXiv 分类

cs.CL