🤖 AI Summary
This work addresses the challenge of real-time online 3D scene perception on edge devices, where limited computational resources and the absence of GPU acceleration hinder deployment. Existing approaches rely on computationally intensive 3D sparse UNets, making them impractical for edge environments. To overcome this, we propose a lightweight and scalable online 3D perception framework centered on a 3D Sparse Feature Pyramid Network (SFPN) that efficiently extracts multi-scale geometric features from streaming point clouds. By integrating a 3D-adapted Segment Anything Model with edge-computing optimizations, our method achieves segmentation accuracy comparable to ESAM on ScanNet, ScanNet200, SceneNN, and 3RScan, while reducing model size by half and accelerating inference by threefold—enabling, for the first time, efficient GPU-free 3D instance segmentation on edge devices.
📝 Abstract
Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and generalized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3 times faster inference with a 2 times smaller model size compared to ESAM, enabling practical deployment on edge devices.