Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
AI 摘要
提出一种基于人类偏好的查询特定评估标准生成方法,用于提升深度研究报告的生成质量。
主要贡献
- 构建了深度研究风格查询及人类偏好标注的数据集
- 提出使用混合奖励强化学习训练评估标准生成器
- 引入多智能体马尔可夫状态工作流(MaMs)提升报告生成效果
方法论
通过强化学习训练评估标准生成器,结合人类偏好监督和LLM评估,并采用MaMs工作流优化长程推理。
原文摘要
Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.