🤖 AI Summary
This work proposes VOILA, a novel framework that introduces the concept of Value of Information (VoI) into multimodal visual question answering to enable cost-aware adaptive inference. Unlike existing systems that rely on fixed visual fidelity, VOILA dynamically balances computational cost and answer accuracy through a two-stage process: first, a gradient-boosted regressor predicts answer accuracy across varying levels of visual fidelity; second, isotonic calibration refines these probability estimates, which are then combined with retrieval costs to select the fidelity level maximizing expected utility. Extensive experiments across five datasets and six large language models demonstrate that VOILA reduces computational costs by 50–60% while preserving 90–95% of the accuracy achieved at full resolution.
📝 Abstract
Despite significant costs from retrieving and processing high-fidelity visual inputs, most multimodal vision-language systems operate at fixed fidelity levels. We introduce VOILA, a framework for Value-Of-Information-driven adaptive fidelity selection in Visual Question Answering (VQA) that optimizes what information to retrieve before model execution. Given a query, VOILA uses a two-stage pipeline: a gradient-boosted regressor estimates correctness likelihood at each fidelity from question features alone, then an isotonic calibrator refines these probabilities for reliable decision-making. The system selects the minimum-cost fidelity maximizing expected utility given predicted accuracy and retrieval costs. We evaluate VOILA across three deployment scenarios using five datasets (VQA-v2, GQA, TextVQA, LoCoMo, FloodNet) and six Vision-Language Models (VLMs) with 7B-235B parameters. VOILA consistently achieves 50-60% cost reductions while retaining 90-95% of full-resolution accuracy across diverse query types and model architectures, demonstrating that pre-retrieval fidelity selection is vital to optimize multimodal inference under resource constraints.