PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing academic benchmarks inadequately evaluate large language models’ open-ended reasoning capabilities in high-stakes professional domains such as law and finance. To address this, we introduce PRBench—the first benchmark specifically designed for high-risk professional reasoning in finance and law—comprising 1,100 expert-crafted real-world tasks and 19,356 fine-grained, rubric-based evaluation criteria spanning 114 countries and 47 U.S. jurisdictions. Its key contributions include: (i) the first publicly released large-scale, fine-grained evaluation framework grounded in expert-designed rubrics; and (ii) rigorous quality assurance via collaborative task design, independent validation, and human-annotated capability categorization. Comprehensive evaluation of 20 state-of-the-art models reveals that even the best-performing models achieve only 0.39 (finance) and 0.37 (law) on high-difficulty subsets, exposing systemic deficiencies—including erroneous judgment, incomplete reasoning, and lack of transparency—thereby highlighting a critical capability gap for real-world deployment.

Technology Category

Application Category

📝 Abstract
Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.
Problem

Research questions and friction points this paper is trying to address.

Evaluating frontier models' real-world professional reasoning in high-stakes domains
Assessing open-ended tasks in Finance and Law with expert-authored rubrics
Identifying critical gaps in model reliability for professional adoption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced large-scale expert-curated rubric benchmark
Recruited qualified professionals to create realistic tasks
Evaluated models using validated expert-authored criteria
🔎 Similar Papers
No similar papers found.
A
Afra Feyza Akyurek
Scale AI
Advait Gosai
Advait Gosai
Scale AI
C
Chen Bo Calvin Zhang
Scale AI
V
Vipul Gupta
Scale AI
Jaehwan Jeong
Jaehwan Jeong
Samsung Electronics
AI SecurityComputer Security
A
Anisha Gunjal
Scale AI
Tahseen Rabbani
Tahseen Rabbani
Postdoctoral Scholar, University of Chicago
machine learningprivacyefficiency
M
Maria Mazzone
Scale AI
D
David Randolph
Scale AI
M
Mohammad Mahmoudi Meymand
Scale AI
G
Gurshaan Chattha
Scale AI
P
Paula Rodriguez
Scale AI
D
Diego Mares
Scale AI
P
Pavit Singh
Scale AI
M
Michael Liu
Scale AI
S
Subodh Chawla
Scale AI
P
Pete Cline
Scale AI
L
Lucy Ogaz
Scale AI
E
Ernesto Hernandez
Scale AI
Z
Zihao Wang
Scale AI
P
Pavi Bhatter
Scale AI
M
Marcos Ayestaran
Scale AI
B
Bing Liu
Scale AI
Yunzhong He
Yunzhong He
University of California, Los Angeles
machine learningnatural language processinginformation retrievalrobot learning