Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of deploying large vision-language models like CLIP on resource-constrained edge devices—e.g., GPU-less retrofit automotive cameras—this paper proposes the first cross-architecture knowledge distillation framework tailored for edge deployment, enabling transfer of CLIP’s zero-shot capabilities to lightweight models. Methodologically, it employs an EfficientNet-B3 backbone coupled with a multi-layer MLP projection head, jointly optimizing a contrastive learning objective and a cross-modal feature alignment loss to compress text-image joint embeddings. Evaluated on an ARM Cortex-A72 processor with 2 GB RAM, the distilled model achieves real-time inference at 23 FPS while consuming under 180 MB memory. It attains zero-shot classification accuracy exceeding 92% of the original CLIP’s performance, marking the first demonstration of practical zero-shot image annotation on low-cost automotive edge hardware.

Technology Category

Application Category

📝 Abstract
Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing. To address this challenge, we propose Clip4Retrofit, an efficient model distillation framework that enables real-time image labeling on edge devices. The framework is deployed on the Retrofit camera, a cost-effective edge device retrofitted into thousands of vehicles, despite strict limitations on compute performance and memory. Our approach distills the knowledge of the CLIP model into a lightweight student model, combining EfficientNet-B3 with multi-layer perceptron (MLP) projection heads to preserve cross-modal alignment while significantly reducing computational requirements. We demonstrate that our distilled model achieves a balance between efficiency and performance, making it ideal for deployment in real-world scenarios. Experimental results show that Clip4Retrofit can perform real-time image labeling and object identification on edge devices with limited resources, offering a practical solution for applications such as autonomous driving and retrofitting existing systems. This work bridges the gap between state-of-the-art vision-language models and their deployment in resource-constrained environments, paving the way for broader adoption of foundation models in edge computing.
Problem

Research questions and friction points this paper is trying to address.

Reducing CLIP model complexity for edge devices
Enabling real-time image labeling on resource-limited hardware
Distilling CLIP into lightweight EfficientNet-B3 with MLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-architecture CLIP distillation for edge devices
Lightweight EfficientNet-B3 with MLP projection heads
Real-time image labeling on resource-constrained hardware
🔎 Similar Papers
No similar papers found.
Li Zhong
Li Zhong
High Performance Computing center Stuttgart (HLRS)
Big dataMachine LearningDeep LearningHPC
A
Ahmed Ghazal
Robert Bosch GmbH
J
Jun-Jun Wan
Robert Bosch GmbH
F
Frederik Zilly
Robert Bosch GmbH
P
Patrick Mackens
Robert Bosch GmbH
J
Joachim E. Vollrath
Robert Bosch GmbH