CATCH: A Modular Cross-domain Adaptive Template with Hook

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VQA models exhibit significant performance degradation on cross-domain tasks—such as remote sensing, medical imaging, and mathematical chart understanding—primarily due to large distribution shifts and insufficient domain adaptation mechanisms. To address this, we propose a plug-and-play, decoupled adaptive framework: a lightweight domain classifier first identifies the input image’s domain; then, domain-specific visual adapters (modulating visual features) and language adapters (modulating textual prompts) are dynamically injected via a unified hook interface—without fine-tuning the backbone network. Our approach ensures strong modularity, high extensibility, and zero retraining overhead. Evaluated on MathVQA, MedVQA-RAD, ChartQA, and a remote sensing VQA benchmark, it achieves consistent improvements: +2.3 BLEU, +2.6 VQA Accuracy, and +3.1 ROUGE, respectively, demonstrating robust cross-domain generalization.

Technology Category

Application Category

📝 Abstract
Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.
Problem

Research questions and friction points this paper is trying to address.

Improves VQA generalization across diverse domains
Enables cross-domain adaptation without retraining backbone
Addresses performance degradation in specialized image domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular plug-and-play framework for cross-domain adaptation
Lightweight domain classifier and dual adapter mechanism
Dynamic injection via unified hook interface without retraining
🔎 Similar Papers
No similar papers found.
X
Xinjin Li
Columbia University, United States
Y
Yulie Lu
Shanghai Jiao Tong University, China
Jinghan Cao
Jinghan Cao
San Francisco State University
Deep LearningLarge Language ModelCloud Software Computating
Yu Ma
Yu Ma
Indiana University
Computer Science
Zhenglin Li
Zhenglin Li
Texas A&M University, College Station, United States
Y
Yeyang Zhou
University of California, San Diego, United States