🤖 AI Summary
Existing methods assume strictly aligned and complete multimodal 3D data, rendering them ill-suited for real-world scenarios involving modality missingness and weak inter-modal alignment. To address this, we propose CrossOver—a novel scene-level, modality-agnostic flexible alignment framework. Unlike prior approaches requiring per-object annotations, CrossOver employs dimension-specific encoders and multi-stage contrastive learning to achieve relaxed alignment of heterogeneous modalities—including RGB images, point clouds, CAD models, floor plans, and textual descriptions—within a unified embedding space. It enables robust cross-modal retrieval and zero-shot object localization under arbitrary modality missingness. Evaluated on ScanNet and 3RScan, CrossOver achieves an average 12.7% improvement in cross-modal retrieval and localization accuracy over state-of-the-art methods, demonstrating superior generalization and practical deployability in realistic settings.
📝 Abstract
Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting adaptability for real-world applications in 3D scene understanding.