ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-text generation benchmarks lack domain-specific expert constraints and interpretable evaluation protocols. Method: We introduce ExpertLongBench—the first expert workflow-oriented benchmark for long-text generation (>5K tokens), covering 11 high-normativity tasks across nine specialized domains—and propose CLEAR, a structured, low-cost evaluation framework. CLEAR leverages expert-co-designed rubrics to auto-generate traceable, fine-grained checklists via information extraction, employs dual-path (reference–output) comparison, and integrates lightweight open-weight models for assessment. Contribution/Results: Evaluation across 11 mainstream LLMs reveals a maximum F1 score of only 26.8%, underscoring task difficulty. CLEAR achieves a strong trade-off between evaluation accuracy and computational overhead. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, though often not accurately; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage.
Problem

Research questions and friction points this paper is trying to address.

Evaluating expert-level long-form generation tasks with structured checklists
Assessing LLMs' adherence to domain-specific requirements in outputs
Developing scalable evaluation framework for accurate long-form outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

ExpertLongBench benchmarks expert-level long-form tasks
CLEAR framework enables fine-grained checklist-based evaluation
Open-weight models achieve accurate checklist extraction
🔎 Similar Papers
No similar papers found.
J
Jie Ruan
Computer Science and Engineering, University of Michigan
Inderjeet Nair
Inderjeet Nair
University of Michigan
Natural language processingNatural language understanding
Shuyang Cao
Shuyang Cao
University of Michigan
Computational Linguistics
Amy Liu
Amy Liu
University of Michigan
Natural Language ProcessingArtificial IntelligenceMachine Learning
Sheza Munir
Sheza Munir
Graduate Student, University of Michigan
Machine LearningNLPMisinformationFactuality
M
Micah Pollens-Dempsey
University of Michigan Law School
T
Tiffany Chiang
University of Michigan Law School
L
Lucy Kates
University of Michigan Law School
N
Nicholas David
Materials Science & Engineering, University of Michigan
S
Sihan Chen
Department of Chemistry, Carnegie Mellon University
R
Ruxin Yang
Biomedical Engineering, University of Michigan
Y
Yuqian Yang
Biomedical Engineering, University of Michigan
J
Jasmine Gump
University of Michigan Law School
T
Tessa Bialek
University of Michigan Law School
V
Vivek Sankaran
University of Michigan Law School
M
Margo Schlanger
University of Michigan Law School
L
Lu Wang
Computer Science and Engineering, University of Michigan