Models Know Models Best: Evaluation via Model-Preferred Formats

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inconsistency in large language models’ performance on multiple-choice tasks caused by variations in evaluation formats—specifically, symbolic versus fill-in-the-blank styles—which often obscure their true capabilities. To resolve this, the authors propose a dynamic format alignment strategy that leverages a lightweight classifier to adaptively select the optimal evaluation format for each question based on the model’s internal preference signals, such as zero-shot likelihood scores. This approach replaces brittle human-designed heuristics with an automated, model-intrinsic signal-driven mechanism. For the first time, it enables automatic format selection grounded in the model’s own preferences, consistently and significantly improving zero-shot accuracy across multiple reasoning and knowledge benchmarks, thereby offering a more faithful assessment of the model’s latent abilities.

Technology Category

Application Category

📝 Abstract
Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models'latent capabilities.
Problem

Research questions and friction points this paper is trying to address.

evaluation format
large language models
performance inconsistency
multiple-choice tasks
model capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

model-preferred formats
dynamic format alignment
latent preference signals
zero-shot evaluation
LLM evaluation
🔎 Similar Papers
No similar papers found.
J
Joonhak Lee
Graduate School of Data Science, Seoul National University
S
Sungmok Jung
Graduate School of Data Science, Seoul National University
J
Jongyeon Park
Graduate School of Data Science, Seoul National University
Jaejin Lee
Jaejin Lee
Dept. of Compter Science and Engineering, Seoul National University
Parallel processingCompilersComputer architecturesOperating systemsHeterogeneous computing