StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

📅 2024-08-02
🏛️ arXiv.org
📈 Citations: 6
Influential: 1
📄 PDF
🤖 AI Summary
To address rigid modality coupling and parameter redundancy in multimodal semantic segmentation, this paper proposes a lightweight, general-purpose encoder-level fusion framework. Methodologically, it (1) eliminates dedicated fusion modules by repurposing large-scale vision foundation models (e.g., ViT, DINOv2) as shared “encode-and-fuse” backbones; (2) introduces MultiAdapter—a novel multi-directional adapter enabling cross-modal and cross-scale information propagation among pretrained encoders; and (3) supports arbitrary combinations of visual modalities (e.g., RGB, infrared, depth) as input. With minimal additional parameters, the approach achieves state-of-the-art performance on four mainstream multimodal segmentation benchmarks. Moreover, it exhibits strong generalizability—seamlessly integrating with existing fusion modules without architectural modification. The framework thus advances efficient, modular, and scalable multimodal representation learning for semantic segmentation.

Technology Category

Application Category

📝 Abstract
Multimodal semantic segmentation shows significant potential for enhancing segmentation accuracy in complex scenes. However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. This approach facilitates comprehensive multi-modal and multi-scale feature fusion, accommodating any visual modal inputs. Specifically, Our framework achieves modal integration during encoding by sharing multi-modal visual information. To enhance information exchange across modalities, we introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. By leveraging MultiAdapter to propagate multi-scale information across pre-trained encoders during the encoding process, StitchFusion achieves multi-modal visual information integration during encoding. Extensive comparative experiments demonstrate that our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters. Furthermore, the experimental integration of MultiAdapter with existing Feature Fusion Modules (FFMs) highlights their complementary nature. Our code is available at StitchFusion_repo.
Problem

Research questions and friction points this paper is trying to address.

Enhance multimodal semantic segmentation accuracy
Reduce modality-specific fusion module restrictions
Enable flexible multi-modal feature integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large-scale pre-trained models as encoders
Introduces MultiAdapter for cross-modal information transfer
Enables flexible multi-modal and multi-scale feature fusion
🔎 Similar Papers
No similar papers found.
B
Bingyu Li
Department of Electronic Engineering and Information Science, University of Science and Technology of China
D
Da Zhang
School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University
Z
Zhiyuan Zhao
Institute of Artificial Intelligence (TeleAI), China
J
Junyu Gao
School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China