Promptception: How Sensitive Are Large Multimodal Models to Prompts?

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large multimodal models (LMMs) exhibit high prompt sensitivity in multiple-choice question answering (MCQA), where minor lexical or structural prompt variations can cause accuracy fluctuations up to 15%, severely compromising evaluation fairness and cross-model comparability. Method: We propose Promptception, a systematic framework that constructs 61 prompt templates spanning six broad categories and fifteen fine-grained subcategories. We conduct large-scale prompt sensitivity analysis across ten state-of-the-art LMMs—including GPT-4o, Gemini 1.5 Pro, and leading open-source models—on MMStar, MMMU-Pro, and MVBench benchmarks. Contribution/Results: Our study is the first to reveal a critical trade-off: closed-source LMMs achieve higher absolute performance but demonstrate significantly lower prompt robustness, whereas open-source models exhibit greater stability. Based on these empirical findings, we derive model-type–aware prompt design principles that enhance evaluation transparency, reproducibility, and reliability.

Technology Category

Application Category

📝 Abstract
Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple-Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open-source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU-Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open-source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating prompt sensitivity in Large Multimodal Models
Assessing accuracy variations due to prompt phrasing changes
Establishing fair evaluation frameworks for LMM performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic framework for evaluating prompt sensitivity
61 prompt types across 15 categories and 6 supercategories
Prompting Principles tailored to proprietary and open-source LMMs
🔎 Similar Papers
No similar papers found.
M
Mohamed Insaf Ismithdeen
Mohamed Bin Zayed University of Artificial Intelligence
Muhammad Uzair Khattak
Muhammad Uzair Khattak
EPFL
Computer VisionMulti-modal LearningVideo Processing
S
Salman Khan
Mohamed Bin Zayed University of Artificial Intelligence, Australian National University