🤖 AI Summary
This work addresses the efficiency and generalization bottlenecks in semantic segmentation across arbitrary sensor modalities, which stem from significant modality discrepancies and the need for repeated development of modality-specific methods. To overcome these challenges, the authors propose a unified multimodal semantic segmentation framework that leverages a modality-aware CLIP architecture to achieve cross-modal semantic alignment. The framework incorporates a modality-aligned embedding mechanism to extract fine-grained features and integrates a domain-specific optimization module that dynamically adapts representations. Built upon a LoRA-finetuned CLIP backbone, the model enables end-to-end joint training across diverse modalities—including RGB, event-based, thermal, and depth data. Evaluated on five heterogeneous modality datasets, the approach achieves a state-of-the-art mean Intersection-over-Union (mIoU) of 65.03%, substantially outperforming existing specialized methods.
📝 Abstract
Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary-modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality-aware CLIP (MA-CLIP), which provides modality-specific scene understanding guidance through LoRA fine-tuning; (2) Modality-aligned Embeddings for capturing fine-grained features; and (3) the Domain-specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi-modal methods and achieves state-of-the-art performance with a mIoU of 65.03%. The codes will be released upon acceptance.