Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the limitations of existing Segment Anything Models (SAMs), which rely heavily on large-scale RGB datasets, incur high computational costs, and lack geometric awareness. To overcome these challenges, we propose a lightweight RGB-D fusion framework that integrates monocular depth priors into EfficientViT-SAM for the first time. Specifically, depth maps generated by a pretrained monocular depth estimator are encoded through a dedicated depth encoder and fused with RGB features at the intermediate representation level. Our approach achieves substantial improvements in segmentation accuracy using only 11.2k training samples—less than 0.1% of the SA-1B dataset—dramatically reducing dependence on massive annotated data while maintaining efficient inference.

Technology Category

Application Category

📝 Abstract

Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.

Problem

Research questions and friction points this paper is trying to address.

Segment Anything Model

efficient segmentation

limited training data

RGB-D fusion

depth-aware segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

RGB-D fusion

depth-aware segmentation

efficient SAM