🤖 AI Summary
This work addresses the inconsistency in large language models’ performance on multiple-choice tasks caused by variations in evaluation formats—specifically, symbolic versus fill-in-the-blank styles—which often obscure their true capabilities. To resolve this, the authors propose a dynamic format alignment strategy that leverages a lightweight classifier to adaptively select the optimal evaluation format for each question based on the model’s internal preference signals, such as zero-shot likelihood scores. This approach replaces brittle human-designed heuristics with an automated, model-intrinsic signal-driven mechanism. For the first time, it enables automatic format selection grounded in the model’s own preferences, consistently and significantly improving zero-shot accuracy across multiple reasoning and knowledge benchmarks, thereby offering a more faithful assessment of the model’s latent abilities.
📝 Abstract
Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models'latent capabilities.