Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

πŸ“… 2025-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
LLM benchmarking suffers from inconsistent cross-benchmark rankings and insufficient discriminability among top-performing models, hindering accurate assessment of true model capabilities. To address this, we propose PSN-IRTβ€”a novel framework that extends Item Response Theory (IRT) by incorporating rich item parameters (e.g., difficulty, discrimination, guessing) and jointly modeling item features and model responses via a pseudo-siamese neural network (PSN). This framework systematically uncovers measurement biases in mainstream benchmarks and enables the construction of compact, high-fidelity evaluation suites. Experiments demonstrate that PSN-IRT reduces ability estimation error by 32% even when test length is halved, while achieving significantly higher alignment with human preferences (+18.7% Kendall’s Ο„) compared to original benchmarks. Moreover, it enhances interpretability of item parameters without compromising psychometric rigor, thereby improving both assessment accuracy and theoretical grounding of LLM evaluation.

Technology Category

Application Category

πŸ“ Abstract
The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining main-stream prominent LLM benchmarks using results from diverse models. We first propose a new framework for accurate and reliable estimations of item characteristics and model abilities. Specifically, we propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. Based on PSN-IRT, we conduct extensive analysis which reveals significant and varied shortcomings in the measurement quality of current benchmarks. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.
Problem

Research questions and friction points this paper is trying to address.

Inconsistent LLM benchmark evaluations across leaderboards
Poor separability among top-performing language models
Current benchmarks lack accurate measurement quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Pseudo-Siamese Network for Item Response Theory
Enhances IRT with rich item parameters
Constructs smaller benchmarks aligned with human preference
πŸ”Ž Similar Papers
No similar papers found.
H
Hongli Zhou
Faculty of Computing, Harbin Institute of Technology, Harbin, China
H
Hui Huang
Faculty of Computing, Harbin Institute of Technology, Harbin, China
Ziqing Zhao
Ziqing Zhao
Master student at Technical University of Munich
computer visionvariational inference
L
Lvyuan Han
Faculty of Computing, Harbin Institute of Technology, Harbin, China
H
Huicheng Wang
Faculty of Computing, Harbin Institute of Technology, Harbin, China
Kehai Chen
Kehai Chen
Harbin Institute of Technolgy (Shenzhen)
LLMNatural Language ProcessingAgentMulti-model Generation
M
Muyun Yang
Faculty of Computing, Harbin Institute of Technology, Harbin, China
Wei Bao
Wei Bao
The University of Sydney
Computer NetworksMobile ComputingWireless Communications
Jian Dong
Jian Dong
Shopee
Computer VisionMachine Learning
B
Bing Xu
Faculty of Computing, Harbin Institute of Technology, Harbin, China
C
Conghui Zhu
Faculty of Computing, Harbin Institute of Technology, Harbin, China
Hailong Cao
Hailong Cao
Harbin Institute of Technology
T
Tiejun Zhao
Faculty of Computing, Harbin Institute of Technology, Harbin, China