BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation

πŸ“… 2025-06-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address insufficient cross-modal synergy and poor robustness to missing modalities in multimodal semantic segmentation, this work reformulates the task as mask-level classification and proposes an RGB/X dual-path architecture to decouple visual (RGB) and non-RGB modalities (e.g., depth). We introduce two novel mechanisms: Unified Modality Matching (UMM) and Cross-Modal Alignment (CMA), enabling modality-agnostic label assignment, complementary label redistribution, and dynamic enhancement of weakly matched queries. The method is built upon a Transformer-based framework integrating modality-decoupled encoding, mask-level contrastive matching, cross-modal query alignment, and a two-stage label assignment strategy. Evaluated on both synthetic and real-world benchmarks, our approach achieves absolute mIoU improvements of 2.75% and 22.74%, respectively, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

πŸ“ Abstract
Utilizing multi-modal data enhances scene understanding by providing complementary semantic and geometric information. Existing methods fuse features or distill knowledge from multiple modalities into a unified representation, improving robustness but restricting each modality's ability to fully leverage its strengths in different situations. We reformulate multi-modal semantic segmentation as a mask-level classification task and propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA) to maximize modality effectiveness and handle missing modalities. Specifically, BiXFormer first categorizes multi-modal inputs into RGB and X, where X represents any non-RGB modalities, e.g., depth, allowing separate processing for each. This design leverages the well-established pretraining for RGB, while addressing the relative lack of attention to X modalities. Then, we propose UMM, which includes Modality Agnostic Matching (MAM) and Complementary Matching (CM). MAM assigns labels to features from all modalities without considering modality differences, leveraging each modality's strengths. CM then reassigns unmatched labels to remaining unassigned features within their respective modalities, ensuring that each available modality contributes to the final prediction and mitigating the impact of missing modalities. Moreover, to further facilitate UMM, we introduce CMA, which enhances the weaker queries assigned in CM by aligning them with optimally matched queries from MAM. Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method, achieving significant improvements in mIoU of +2.75% and +22.74% over the prior arts.
Problem

Research questions and friction points this paper is trying to address.

Maximizing modality effectiveness in multi-modal semantic segmentation
Handling missing modalities in multi-modal data fusion
Improving robustness by leveraging complementary modality strengths
Innovation

Methods, ideas, or system contributions that make the work stand out.

BiXFormer integrates UMM and CMA techniques
Separates RGB and non-RGB modalities for processing
Uses MAM and CM for robust label assignment
πŸ”Ž Similar Papers
No similar papers found.