Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of fine-grained, scalable, and human-preference-aligned evaluation metrics in existing DeepResearch report generation systems. To this end, the authors propose a novel approach that first constructs a dataset of query-report pairs annotated with human preferences, then employs reinforcement learning to train a query-specific scoring rule generator. This generator is integrated into a Multi-agent Markovian State (MaMs) workflow to jointly optimize long-horizon reasoning and evaluation. The method represents the first effort to automatically learn query-aware scoring rules directly from human preferences, substantially enhancing the discriminability, granularity, and scalability of evaluation. When incorporated into a DeepResearch system, this mechanism outperforms all open-source baselines on the DeepResearch Bench and achieves performance comparable to state-of-the-art closed-source models.

Technology Category

Application Category

📝 Abstract
Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.
Problem

Research questions and friction points this paper is trying to address.

DeepResearch
rubric-based evaluation
human preferences
report generation
reward signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

query-specific rubrics
human preference alignment
reinforcement learning
Multi-agent Markov-state (MaMs)
DeepResearch report generation
🔎 Similar Papers
No similar papers found.
C
Changze Lv
Pattern Recognition Center, WeChat AI, Tencent Inc.; College of Computer Science and Artificial Intelligence, Fudan University
Jie Zhou
Jie Zhou
Tencent Wechat AI
nlp
W
Wentao Zhao
Pattern Recognition Center, WeChat AI, Tencent Inc.
J
Jingwen Xu
College of Computer Science and Artificial Intelligence, Fudan University
Z
Zisu Huang
College of Computer Science and Artificial Intelligence, Fudan University
M
Muzhao Tian
College of Computer Science and Artificial Intelligence, Fudan University
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
T
Tao Gui
College of Computer Science and Artificial Intelligence, Fudan University
Le Tian
Le Tian
University of Antwerpen - imec
Internet of thingssensor networksIEEE 802.11ah
Xiao Zhou
Xiao Zhou
M.Phil student in HKUST
Autonomous DrivingDRL
Xiaoqing Zheng
Xiaoqing Zheng
Fudan University
Natural Language Processing and Machine Learning
X
Xuanjing Huang
College of Computer Science and Artificial Intelligence, Fudan University