Aligning Language Model Benchmarks with Pairwise Preferences

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing language model benchmarks struggle to accurately reflect human preferences in real-world scenarios. To address this limitation, this work proposes BenchAlign, a novel framework that dynamically learns per-item weights by leveraging limited pairwise model preference data and item-level performance, thereby constructing a static, interpretable, and scale-invariant alignment benchmark. Notably, BenchAlign requires no additional human annotation; it aligns the benchmark with human preferences using only ranking pairs collected during model deployment. Experimental results demonstrate that the resulting benchmark effectively predicts human preference rankings for unseen models, exhibiting strong generalization capability and interpretability.

Technology Category

Application Category

📝 Abstract
Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.
Problem

Research questions and friction points this paper is trying to address.

benchmark alignment
language models
pairwise preferences
human preferences
model evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

benchmark alignment
pairwise preferences
BenchAlign
language model evaluation
preference-aligned weighting