FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VQA robustness evaluations primarily focus on unimodal or localized out-of-distribution (OOD) scenarios, failing to capture the complexity of real-world multimodal distribution shifts. To address this gap, we introduce the first benchmark for evaluating fine-tuning robustness of VQA models under multimodal distribution shifts, covering in-distribution (ID), near-OOD, and far-OOD settings across 10 diverse VQA datasets. Our framework systematically models unimodal, multimodal, and adversarial shifts. We propose a novel multimodal joint shift classification framework, leveraging Mahalanobis distance to quantify cross-modal embedding shifts and measure modality interaction effects and relative importance. A unified evaluation protocol enables comprehensive assessment of mainstream robust fine-tuning methods, revealing critical modality-dependency patterns. All code and toolkits are publicly released.

Technology Category

Application Category

📝 Abstract
Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/chengyuehuang511/FRAMES-VQA .
Problem

Research questions and friction points this paper is trying to address.

Evaluating VQA robustness to multi-modal data shifts
Benchmarking fine-tuning methods for ID and OOD scenarios
Analyzing modality interactions in distribution shifts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes FRAMES-VQA benchmark for multi-modal shifts
Utilizes Mahalanobis distance to quantify distribution shifts
Analyzes uni-modal and multi-modal shifts interactions
🔎 Similar Papers
No similar papers found.