PicoSAM2: Low-Latency Segmentation In-Sensor for Edge Vision Applications

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the demand for low-latency, privacy-preserving image segmentation in edge vision applications (e.g., smart glasses, IoT devices), this work proposes the first sensor-side deployable lightweight promptable segmentation framework. Methodologically, it introduces a depthwise-separable U-Net backbone, fixed-point prompt encoding, and synergistic optimization via knowledge distillation and quantization-aware compression. Under stringent resource constraints—1.3M parameters (1.22 MB memory footprint)—the model achieves real-time inference: 14.3 ms latency on the Sony IMX500 ISP, attaining 86 MACs/cycle efficiency. It achieves 51.9% and 44.9% mIoU on COCO and LVIS benchmarks, respectively, with distilled variants showing significant LVIS performance gains. This is the first work to embed promptable segmentation capabilities directly into the camera sensing layer, uniquely balancing real-time operation, on-device privacy preservation, and practical deployment feasibility.

Technology Category

Application Category

📝 Abstract
Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications like smart glasses and IoT devices. We introduce PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500. It builds on a depthwise separable U-Net, with knowledge distillation and fixed-point prompt encoding to learn from the Segment Anything Model 2 (SAM2). On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively. The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment. Distillation boosts LVIS performance by +3.5% mIoU and +5.1% mAP. These results demonstrate that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud or host processing.
Problem

Research questions and friction points this paper is trying to address.

Enabling real-time segmentation for edge vision applications
Optimizing lightweight model for in-sensor execution constraints
Achieving privacy-preserving vision without cloud processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight depthwise separable U-Net model
Knowledge distillation from SAM2
Fixed-point prompt encoding for efficiency