🤖 AI Summary
Large multimodal models (LMMs) exhibit high prompt sensitivity in multiple-choice question answering (MCQA), where minor lexical or structural prompt variations can cause accuracy fluctuations up to 15%, severely compromising evaluation fairness and cross-model comparability.
Method: We propose Promptception, a systematic framework that constructs 61 prompt templates spanning six broad categories and fifteen fine-grained subcategories. We conduct large-scale prompt sensitivity analysis across ten state-of-the-art LMMs—including GPT-4o, Gemini 1.5 Pro, and leading open-source models—on MMStar, MMMU-Pro, and MVBench benchmarks.
Contribution/Results: Our study is the first to reveal a critical trade-off: closed-source LMMs achieve higher absolute performance but demonstrate significantly lower prompt robustness, whereas open-source models exhibit greater stability. Based on these empirical findings, we derive model-type–aware prompt design principles that enhance evaluation transparency, reproducibility, and reliability.
📝 Abstract
Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple-Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open-source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU-Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open-source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.