🤖 AI Summary
To address the high computational cost and low representation efficiency of monocular 3D scene understanding in autonomous driving collective perception, this paper proposes a lightweight monocular 3D scene representation method. Our approach integrates fine-grained 3D Stixel units with a learnable clustering mechanism, enabling semantic-aware adaptive clustering that compresses scene representations while improving object segmentation accuracy. We design a lightweight neural network that takes a single RGB image as input and jointly leverages depth estimation and LiDAR-based self-supervised ground truth to efficiently generate Stixel representations—natively supporting multimodal outputs including point clouds and bird’s-eye-view (BEV) maps. Evaluated on the Waymo Open Dataset within a 30-meter range, our method achieves state-of-the-art performance with only 10 ms inference time per frame, striking an optimal balance among real-time efficiency, accuracy, and compatibility with collaborative perception systems.
📝 Abstract
This paper presents StixelNExT++, a novel approach to scene representation for monocular perception systems. Building on the established Stixel representation, our method infers 3D Stixels and enhances object segmentation by clustering smaller 3D Stixel units. The approach achieves high compression of scene information while remaining adaptable to point cloud and bird's-eye-view representations. Our lightweight neural network, trained on automatically generated LiDAR-based ground truth, achieves real-time performance with computation times as low as 10 ms per frame. Experimental results on the Waymo dataset demonstrate competitive performance within a 30-meter range, highlighting the potential of StixelNExT++ for collective perception in autonomous systems.