π€ AI Summary
Large language model (LLM) performance is highly sensitive to prompt template design, and manual prompt engineering incurs substantial computational and human effort.
Method: We observe strong cross-model consistency in prompt preferences across LLMs of varying scales on diverse tasks. Leveraging this insight, we propose a novel βsmall-model-driven automatic prompt selection for large modelsβ paradigm: training a lightweight prompt evaluator to predict high-performing prompts for target LLMs, with built-in support for cross-scale generalization. Our approach integrates multi-LLM comparative prompt preference analysis, zero- and few-shot prompt evaluation, and cross-model transfer strategies.
Contribution/Results: Evaluated across 14 LLMs on question answering, natural language inference, and other tasks, our selected prompts achieve state-of-the-art performance while drastically reducing the computational overhead and manual labor traditionally required for prompt engineering.
π Abstract
The performance of pre-trained Large Language Models (LLMs) is often sensitive to nuances in prompt templates, requiring careful prompt engineering, adding costs in terms of computing and human effort. In this study, we present experiments encompassing multiple LLMs variants of varying sizes aimed at probing their preference with different prompts. Through experiments on Question Answering, we show prompt preference consistency across LLMs of different sizes. We also show that this consistency extends to other tasks, such as Natural Language Inference. Utilizing this consistency, we propose a method to use a smaller model to select effective prompt templates for a larger model. We show that our method substantially reduces the cost of prompt engineering while consistently matching performance with optimal prompts among candidates. More importantly, our experiment shows the efficacy of our strategy across fourteen LLMs and its applicability to a broad range of NLP tasks, highlighting its robustness