Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the lack of fine-grained, scalable, and human-preference-aligned evaluation metrics in existing DeepResearch report generation systems. To this end, the authors propose a novel approach that first constructs a dataset of query-report pairs annotated with human preferences, then employs reinforcement learning to train a query-specific scoring rule generator. This generator is integrated into a Multi-agent Markovian State (MaMs) workflow to jointly optimize long-horizon reasoning and evaluation. The method represents the first effort to automatically learn query-aware scoring rules directly from human preferences, substantially enhancing the discriminability, granularity, and scalability of evaluation. When incorporated into a DeepResearch system, this mechanism outperforms all open-source baselines on the DeepResearch Bench and achieves performance comparable to state-of-the-art closed-source models.

Technology Category

Application Category

📝 Abstract

Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.

Problem

Research questions and friction points this paper is trying to address.

DeepResearch

rubric-based evaluation

human preferences

report generation

reward signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

query-specific rubrics

human preference alignment

reinforcement learning