🤖 AI Summary
Current large language model evaluation frameworks commonly rely on uniform, static prompt templates, neglecting model-specific prompt optimization and thereby introducing performance distortion and ranking bias. This work presents the first systematic investigation into the impact of prompt optimization on model evaluation and introduces a novel “optimize-then-evaluate” paradigm: prompts are individually optimized for each model prior to performance assessment. Through comprehensive experiments employing diverse prompt optimization techniques across established academic and industrial benchmarks, the study demonstrates that prompt optimization substantially alters model rankings. These findings underscore the critical role of customized prompting in achieving accurate evaluations and informed model selection, effectively bridging the gap between academic assessment protocols and real-world industrial practices.
📝 Abstract
Current Large Language Model (LLM) evaluation frameworks utilize the same static prompt template across all models under evaluation. This differs from the common industry practice of using prompt optimization (PO) techniques to optimize the prompt for each model to maximize application performance. In this paper, we investigate the effect of PO towards LLM evaluations. Our results on public academic and internal industry benchmarks show that PO greatly affects the final ranking of models. This highlights the importance of practitioners performing PO per model when conducting evaluations to choose the best model for a given task.