Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit near-saturation performance on traditional benchmarks, which inadequately capture their capabilities in complex, open-ended expert-level tasks. To address this limitation, this work introduces XpertBench, a high-fidelity benchmark comprising 1,346 real-world tasks contributed by over 1,000 domain experts across 80 specialized fields. The study further proposes ShotJudge, a novel evaluation paradigm that integrates fine-grained, multidimensional weighted rubrics—featuring 15–40 checkpoints per task—with a few-shot-calibrated LLM-based judge to mitigate self-evaluation bias and enhance ecological validity. Experimental results reveal that even state-of-the-art models achieve only around 55% average performance on this benchmark, with the highest success rate reaching approximately 66%, thereby exposing a substantial “expertise gap” in current LLM capabilities.
📝 Abstract
As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.
Problem

Research questions and friction points this paper is trying to address.

expert-level evaluation
large language models
benchmarking
open-ended tasks
ecological validity
Innovation

Methods, ideas, or system contributions that make the work stand out.

XpertBench
rubrics-based evaluation
expert-level tasks
ShotJudge
ecological validity
🔎 Similar Papers
No similar papers found.
X
Xue Liu
X
Xin Ma
Y
Yuxin Ma
Y
Yongchang Peng
D
Duo Wang
Zhoufutu Wen
Zhoufutu Wen
ByteDance SEED
LLM Evaluation
G
Ge Zhang
K
Kaiyuan Zhang
X
Xinyu Chen
T
Tianci He
J
Jiani Hou
L
Liang Hu
Z
Ziyun Huang
Y
Yongzhe Hui
J
Jianpeng Jiao
C
Chennan Ju
Y
Yingru Kong
Y
Yiran Li
M
Mengyun Liu
L
Luyao Ma
Fei Ni
Fei Ni
Imperial College London
Reinforcement LearningEmbodied AI
Y
Yiqing Ni
Y
Yueyan Qiu
Y
Yanle Ren
Z
Zilin Shi