One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Current evaluations of instruction-based embedding models commonly rely on a single prompt, overlooking the models’ sensitivity to variations in instruction wording. This work systematically examines 15 prompt variants per task across six prominent embedding models and eleven datasets—amounting to 990 experimental configurations—and reveals, for the first time at scale, that single-prompt evaluation can severely mislead performance assessment. The default prompt may systematically over- or under-estimate model capabilities, and any model can be made to top leaderboards through favorable prompt selection. To address this issue, the study advocates for adopting multi-prompt evaluation protocols or reporting prompt sensitivity metrics to enhance the robustness and fairness of model assessments.

📝 Abstract

Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.

Problem

Research questions and friction points this paper is trying to address.

instruction sensitivity

embedding models

prompt robustness

model evaluation

leaderboard ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction sensitivity

embedding models

prompt robustness