ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

๐Ÿ“… 2025-10-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM evaluations predominantly focus on narrow tasks such as mathematical reasoning and code generation, failing to adequately assess capabilities in realistic professional settingsโ€”e.g., domain-specific document understanding, cross-source information integration, and structured report generation. To address this gap, we introduce ProfBench, a rigorous benchmark comprising 7,000+ expert-constructed question-answer pairs spanning physics, chemistry, finance, and management consulting, accompanied by a human-informed, multi-dimensional scoring framework. We further propose LLM-Judgeโ€”a low-cost, high-discriminative automated evaluator integrating human annotations, bias-mitigation strategies, and chain-of-thought reasoning expansion to enhance fairness and accessibility. Empirical results reveal that even the state-of-the-art GPT-5-high achieves only 65.9% accuracy on ProfBench, underscoring fundamental limitations in professional-domain competence. The benchmark also exposes substantial performance disparities between open- and closed-source models and validates the efficacy of extended reasoning for improving evaluation fidelity.

Technology Category

Application Category

๐Ÿ“ Abstract
Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on professional tasks requiring expert knowledge verification
Developing affordable automated judges for multi-domain expert assessments
Addressing performance gaps in complex professional document processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ProfBench with expert-evaluated response-criterion pairs
Builds affordable LLM-Judges reducing evaluation cost significantly
Mitigates self-enhancement bias for fair and accessible assessment
๐Ÿ”Ž Similar Papers
No similar papers found.