🤖 AI Summary
This work addresses the challenge of cross-modal RGB-X sensor data alignment, which typically relies on expensive hardware calibration. The authors propose a novel cross-modal view synthesis method that operates without requiring depth or calibration information from the X modality. By leveraging only low-cost COLMAP processing on RGB images, the approach achieves 3D-consistent novel view synthesis through a pipeline comprising RGB-X image matching, confidence-aware guided point cloud densification, self-matching filtering, and integration with 3D Gaussian Splatting. This study presents the first demonstration of high-quality cross-modal alignment under the absence of any 3D priors from the X modality, substantially lowering the barrier for multimodal data acquisition and overcoming a key bottleneck in scaling real-world RGB-X dataset collection.
📝 Abstract
We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection.