🤖 AI Summary
This work investigates whether heterogeneous vision foundation models (VFMs) can be efficiently integrated through model stitching while mitigating the performance degradation commonly caused by shallow stitching. To this end, the authors propose a unified stitching protocol that encompasses strategic selection of stitching points, design of lightweight stitching layers, a feature-matching loss based on the penultimate layer of the target model, and an end-to-end fine-tuning strategy. Innovatively, they introduce the VFM Stitch Tree architecture, which enables multiple VFMs to share early layers while preserving their distinct later layers, thereby achieving a controllable trade-off between accuracy and inference latency. Experiments demonstrate that the proposed method enables reliable stitching across diverse vision tasks, with deep stitching consistently outperforming any individual source model while incurring only minimal computational overhead.
📝 Abstract
Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.