🤖 AI Summary
To address the dual requirements of high accuracy—especially for rare defects—and low latency in industrial robot vision inspection for smart manufacturing, this paper proposes a novel cloud-edge collaborative paradigm. We deploy large vision models (e.g., SAM) in the cloud and lightweight semantic segmentation models on edge devices, enabling adaptive collaborative inference and online model distillation guided by hard-sample mining. We introduce the first plug-and-play architecture for cooperative large-and-small model integration, supporting supervised edge-model updates and policy evolution under dynamic data drift. Evaluated on a real-world robotic semantic segmentation system, our approach achieves an 8.2% accuracy gain over state-of-the-art methods, reduces end-to-end latency by 37%, and cuts communication overhead by 51%, while ensuring strong environmental adaptability and theoretical feasibility.
📝 Abstract
Recent large vision models (e.g., SAM) enjoy great potential to facilitate intelligent perception with high accuracy. Yet, the resource constraints in the IoT environment tend to limit such large vision models to be locally deployed, incurring considerable inference latency thereby making it difficult to support real-time applications, such as autonomous driving and robotics. Edge-cloud collaboration with large-small model co-inference offers a promising approach to achieving high inference accuracy and low latency. However, existing edge-cloud collaboration methods are tightly coupled with the model architecture and cannot adapt to the dynamic data drifts in heterogeneous IoT environments. To address the issues, we propose LAECIPS, a new edge-cloud collaboration framework. In LAECIPS, both the large vision model on the cloud and the lightweight model on the edge are plug-and-play. We design an edge-cloud collaboration strategy based on hard input mining, optimized for both high accuracy and low latency. We propose to update the edge model and its collaboration strategy with the cloud under the supervision of the large vision model, so as to adapt to the dynamic IoT data streams. Theoretical analysis of LAECIPS proves its feasibility. Experiments conducted in a robotic semantic segmentation system using real-world datasets show that LAECIPS outperforms its state-of-the-art competitors in accuracy, latency, and communication overhead while having better adaptability to dynamic environments.