🤖 AI Summary
Existing multimodal large language models typically adopt stochastic decoding strategies inherited from text-only language models for visual question answering (VQA), overlooking the fact that VQA answers exhibit a concentrated distribution and that uncertainty primarily stems from insufficient visual evidence. This work re-examines the suitability of greedy decoding from a calibration perspective, establishing for the first time sufficient conditions under which greedy decoding is optimal. Building on model calibration theory, the authors propose an improved greedy decoding method tailored for multimodal reasoning. The approach consistently outperforms standard greedy and stochastic sampling strategies across multiple VQA benchmarks, challenging the prevailing paradigm of defaulting to random decoding in multimodal generation and providing both theoretical justification and empirical evidence for the efficacy of deterministic decoding in VQA.
📝 Abstract
Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.