OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address the lack of flexible pretraining and fine-tuning paradigms in multimodal visual semantic segmentation, this paper proposes the first general-purpose multimodal semantic segmentation learning framework. Methodologically, it introduces (1) ImageNeXt—a large-scale pretraining dataset extending ImageNet to cover five modalities: RGB, depth, event, thermal infrared, and polarization; (2) a unified cross-modal encoding mechanism enabling arbitrary modality combinations and end-to-end joint modeling; and (3) a multimodal pretraining–task-adaptive fine-tuning pipeline that achieves deep cross-modal feature fusion and segmentation optimization. Evaluated on six major benchmarks—NYU Depth v2, EventScape, MFNet, DeLiVER, SUNRGBD, and KITTI-360—the framework achieves state-of-the-art performance across all, demonstrating significant improvements in model generalization and modality robustness.

Technology Category

Application Category

📝 Abstract

Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this paper, we propose a novel multi-modal learning framework, termed OmniSegmentor. It has two key innovations: 1) Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called ImageNeXt, which contains five popular visual modalities. 2) We provide an efficient pretraining manner to endow the model with the capacity to encode different modality information in the ImageNeXt. For the first time, we introduce a universal multi-modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios, regardless of the arbitrary combination of the involved modalities. Remarkably, our OmniSegmentor achieves new state-of-the-art records on a wide range of multi-modal semantic segmentation datasets, including NYU Depthv2, EventScape, MFNet, DeLiVER, SUNRGBD, and KITTI-360.

Problem

Research questions and friction points this paper is trying to address.

Develops flexible multi-modal pretraining for semantic segmentation

Creates large-scale ImageNeXt dataset with five visual modalities

Establishes universal framework for arbitrary modality combinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal pretraining dataset ImageNeXt

Efficient encoding of multiple visual modalities

Universal framework for arbitrary modality combinations

🔎 Similar Papers

StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation