PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work proposes a lightweight, promptable visual segmentation model tailored for edge devices such as the Sony IMX500 image sensor, addressing the demand for low-latency and privacy-preserving inference in smart glasses and IoT applications. The approach integrates a dense CNN architecture, region-of-interest prompt encoding, and an efficient channel attention mechanism, enabling high-quality, spatially flexible promptable segmentation directly on the image sensor for the first time. By distilling knowledge from large models like SAM2/SAM3, the compact model achieves a 14.5% mIoU improvement over supervised training alone, reaching 65.45% on COCO and 64.01% on LVIS. With INT8 quantization, it delivers real-time inference at 11.82 ms on the IMX500 with negligible accuracy degradation.

Technology Category

Application Category

📝 Abstract

Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.

Problem

Research questions and friction points this paper is trying to address.

real-time segmentation

in-sensor processing

edge AI

region-of-interest segmentation

privacy-aware vision

Innovation

Methods, ideas, or system contributions that make the work stand out.

in-sensor segmentation

promptable segmentation

knowledge distillation