MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently distilling region-level multimodal semantic knowledge from large vision-language teacher models (e.g., LLaVA) into lightweight, text-free pure-vision detectors (e.g., YOLO), without modifying the teacher architecture or requiring textual input during inference. The proposed cross-architecture knowledge distillation method introduces a learnable translation module that maps student visual features into the teacher’s joint embedding space, enabling object-level multimodal-to-unimodal knowledge transfer—the first such approach. A dual-objective loss function is designed to jointly enforce local region alignment and global relational consistency between teacher and student representations. Evaluated on four few-shot detection benchmarks, the method achieves an average +10.1 mAP improvement over baseline detectors, matching or surpassing the performance of significantly larger multimodal models while maintaining high deployment feasibility.

Technology Category

Application Category

📝 Abstract
We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.
Problem

Research questions and friction points this paper is trying to address.

Transfers multimodal semantics from teacher to student model
Enables object-level alignment without textual input
Improves few-shot detection performance efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation transfers multimodal semantics
Translation module maps features into joint space
Dual-objective loss ensures local and global alignment
🔎 Similar Papers
No similar papers found.
Elena Camuffo
Elena Camuffo
University of Padova
Scene UnderstandingRepresentation LearningNeural RenderingComputer GraphicsExtended Reality
F
Francesco Barbato
Samsung R&D Institute UK, United Kingdom
M
Mete Ozay
Samsung R&D Institute UK, United Kingdom
Simone Milani
Simone Milani
Associate Professor, University of Padova, Padova, Italy
computer visionsignal processingsource codingmultimedia forensics
U
Umberto Michieli
Samsung R&D Institute UK, United Kingdom