🤖 AI Summary
Existing VQA models exhibit significant performance degradation on cross-domain tasks—such as remote sensing, medical imaging, and mathematical chart understanding—primarily due to large distribution shifts and insufficient domain adaptation mechanisms. To address this, we propose a plug-and-play, decoupled adaptive framework: a lightweight domain classifier first identifies the input image’s domain; then, domain-specific visual adapters (modulating visual features) and language adapters (modulating textual prompts) are dynamically injected via a unified hook interface—without fine-tuning the backbone network. Our approach ensures strong modularity, high extensibility, and zero retraining overhead. Evaluated on MathVQA, MedVQA-RAD, ChartQA, and a remote sensing VQA benchmark, it achieves consistent improvements: +2.3 BLEU, +2.6 VQA Accuracy, and +3.1 ROUGE, respectively, demonstrating robust cross-domain generalization.
📝 Abstract
Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.