MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the challenge of efficiently distilling region-level multimodal semantic knowledge from large vision-language teacher models (e.g., LLaVA) into lightweight, text-free pure-vision detectors (e.g., YOLO), without modifying the teacher architecture or requiring textual input during inference. The proposed cross-architecture knowledge distillation method introduces a learnable translation module that maps student visual features into the teacher’s joint embedding space, enabling object-level multimodal-to-unimodal knowledge transfer—the first such approach. A dual-objective loss function is designed to jointly enforce local region alignment and global relational consistency between teacher and student representations. Evaluated on four few-shot detection benchmarks, the method achieves an average +10.1 mAP improvement over baseline detectors, matching or surpassing the performance of significantly larger multimodal models while maintaining high deployment feasibility.

Technology Category

Application Category

📝 Abstract

We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Transfers multimodal semantics from teacher to student model

Enables object-level alignment without textual input

Improves few-shot detection performance efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation transfers multimodal semantics

Translation module maps features into joint space

Dual-objective loss ensures local and global alignment

🔎 Similar Papers

No similar papers found.