🤖 AI Summary
To address degraded model robustness caused by noisy labels in 2D–3D cross-modal retrieval, this paper proposes the Multi-level Adaptive Correction and Alignment (MCA) framework. MCA introduces a novel multimodal joint label correction mechanism that models cross-modal consistency via historical self-predictions, thereby mitigating overfitting induced by label noise. Concurrently, it designs a hierarchical feature alignment strategy that enables adaptive cross-modal matching at pixel-, region-, and semantic-levels. Integrating contrastive learning with a self-training paradigm, MCA enhances generalization under label noise. Evaluated on standard and noisy 3D benchmarks—including ScanNet and ModelNet—MCA achieves state-of-the-art performance, significantly outperforming existing robust cross-modal retrieval methods.
📝 Abstract
With the increasing availability of 2D and 3D data, significant advancements have been made in the field of cross-modal retrieval. Nevertheless, the existence of imperfect annotations presents considerable challenges, demanding robust solutions for 2D-3D cross-modal retrieval in the presence of noisy label conditions. Existing methods generally address the issue of noise by dividing samples independently within each modality, making them susceptible to overfitting on corrupted labels. To address these issues, we propose a robust 2D-3D extbf{M}ulti-level cross-modal adaptive extbf{C}orrection and extbf{A}lignment framework (MCA). Specifically, we introduce a Multimodal Joint label Correction (MJC) mechanism that leverages multimodal historical self-predictions to jointly model the modality prediction consistency, enabling reliable label refinement. Additionally, we propose a Multi-level Adaptive Alignment (MAA) strategy to effectively enhance cross-modal feature semantics and discrimination across different levels. Extensive experiments demonstrate the superiority of our method, MCA, which achieves state-of-the-art performance on both conventional and realistic noisy 3D benchmarks, highlighting its generality and effectiveness.