🤖 AI Summary
To address the dual challenges of inaccurate geometric modeling and absence of dense 3D supervision in pure RGB multi-view 3D object detection, this paper proposes a Gaussian-voxel collaborative representation framework. We pioneer the adaptation of generalizable Gaussian splatting to detection tasks, jointly leveraging discrete voxel encoding to establish a continuous–discrete complementary geometric representation. A learnable cross-representation enhancement module is designed to deeply fuse fine-grained geometric details with global spatial structure. Furthermore, unsupervised geometric consistency constraints and multi-view feature alignment are introduced—eliminating per-scene optimization and task-specific depth regularization. Our method achieves state-of-the-art performance on ScanNetV2 and ARKitScenes, operating entirely without point clouds, TSDFs, or any other depth or dense 3D supervision signals.
📝 Abstract
Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images, eliminating the need for expensive depth sensors required by point cloud-based methods. Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision, while those operating without such supervision struggle to extract accurate geometry from images alone. In this paper, we present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning. Our key insight is that continuous Gaussian and discrete voxel representations capture complementary geometric information: Gaussians excel at modeling fine-grained surface details while voxels provide structured spatial context. We introduce a dual-representation architecture that: 1) adapts generalizable Gaussian Splatting to extract complementary geometric features for detection tasks, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous methods that either rely on time-consuming per-scene optimization or utilize Gaussian representations solely for depth regularization, our synergistic strategy directly leverages features from both representations through learnable integration, enabling more accurate object localization. Extensive experiments demonstrate that GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision (e.g., point clouds or TSDF).