CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenges of catastrophic forgetting and poor generalization in multi-domain task-incremental learning without explicit task identifiers. To this end, the authors propose a zero-parameter-overhead approach built upon a frozen CLIP model. By fully leveraging CLIP’s textual embedding space, the method introduces text-space task routing, a multi-prototype vision-text confidence mechanism, and symmetric cross-modal gating to jointly enable task identification, confidence estimation, and encoder adaptation. Evaluated on the MTIL benchmark, the approach achieves 74.2% transfer accuracy, 80.5% average accuracy, and 88.7% last-task accuracy—surpassing the current state of the art by 3.0–5.0 percentage points—while requiring only 2.5M trainable parameters and no external data, thereby significantly enhancing model stability and robustness under data scarcity.

📝 Abstract

Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP's cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.

Problem

Research questions and friction points this paper is trying to address.

multi-domain task-incremental learning

cross-modal prompting

vision-language models

task identity agnostic

catastrophic forgetting

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal prompting

task-incremental learning

CLIP text embedding