🤖 AI Summary
Existing multi-foundation-model feature fusion methods rely heavily on downstream fine-tuning or labor-intensive hyperparameter optimization. Method: We propose ComBo—a probe-style adapter that freezes backbone parameters and requires no backpropagation. ComBo employs a lightweight Transformer to integrate token-level compressed features from multiple models and hierarchical levels, introducing a novel multi-backbone joint probing mechanism for cross-model feature fusion and task-relevance self-adaptation. It is dataset-agnostic—requiring no task-specific hyperparameters—and preserves all foundation model parameters unchanged. Contribution/Results: On all 19 VTAB-1k tasks, ComBo significantly outperforms existing probing baselines, matches or surpasses costly distillation-based model merging approaches, and remains compatible with efficient downstream probing of fine-tuned models. ComBo establishes a new paradigm for general-purpose, efficient, plug-and-play multi-model feature composition.
📝 Abstract
Foundation models (FMs) trained with different objectives and data learn diverse representations, making some more effective than others for specific downstream tasks. Existing adaptation strategies, such as parameter-efficient fine-tuning, focus on individual models and do not exploit the complementary strengths across models. Probing methods offer a promising alternative by extracting information from frozen models, but current techniques do not scale well with large feature sets and often rely on dataset-specific hyperparameter tuning. We propose Combined backBones (ComBo), a simple and scalable probing-based adapter that effectively integrates features from multiple models and layers. ComBo compresses activations from layers of one or more FMs into compact token-wise representations and processes them with a lightweight transformer for task-specific prediction. Crucially, ComBo does not require dataset-specific tuning or backpropagation through the backbone models. However, not all models are equally relevant for all tasks. To address this, we introduce a mechanism that leverages ComBo's joint multi-backbone probing to efficiently evaluate each backbone's task-relevance, enabling both practical model comparison and improved performance through selective adaptation. On the 19 tasks of the VTAB-1k benchmark, ComBo outperforms previous probing methods, matches or surpasses more expensive alternatives, such as distillation-based model merging, and enables efficient probing of tuned models. Our results demonstrate that ComBo offers a practical and general-purpose framework for combining diverse representations from multiple FMs.