🤖 AI Summary
This work addresses the challenge of efficiently distilling region-level multimodal semantic knowledge from large vision-language teacher models (e.g., LLaVA) into lightweight, text-free pure-vision detectors (e.g., YOLO), without modifying the teacher architecture or requiring textual input during inference. The proposed cross-architecture knowledge distillation method introduces a learnable translation module that maps student visual features into the teacher’s joint embedding space, enabling object-level multimodal-to-unimodal knowledge transfer—the first such approach. A dual-objective loss function is designed to jointly enforce local region alignment and global relational consistency between teacher and student representations. Evaluated on four few-shot detection benchmarks, the method achieves an average +10.1 mAP improvement over baseline detectors, matching or surpassing the performance of significantly larger multimodal models while maintaining high deployment feasibility.
📝 Abstract
We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.