Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes

📅 2024-10-14
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of multi-sensor fusion in autonomous driving under challenging environmental conditions (e.g., rain, fog, low illumination), this paper proposes a condition-aware dynamic fusion framework. The method leverages RGB images as cues for environmental condition estimation and introduces a novel condition-token-driven dynamic fusion mechanism. It further incorporates modality-specific feature adapters to enable efficient heterogeneous sensor (e.g., LiDAR, camera) alignment into a shared latent space and adaptive weighted fusion. The framework integrates a condition classification network, a pre-trained vision backbone, and a cross-modal alignment module. On the MUSES benchmark, it achieves state-of-the-art performance with 59.7 PQ and 78.2 mIoU—ranking first—and sets new records on DeLiVER. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Leveraging multiple sensors is crucial for robust semantic perception in autonomous driving, as each sensor type has complementary strengths and weaknesses. However, existing sensor fusion methods often treat sensors uniformly across all conditions, leading to suboptimal performance. By contrast, we propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes. Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities. We further newly introduce modality-specific feature adapters to align diverse sensor inputs into a shared latent space, enabling efficient integration with a single and shared pre-trained backbone. By dynamically adapting sensor fusion based on the actual condition, our model significantly improves robustness and accuracy, especially in adverse-condition scenarios. CAFuser ranks first on the public MUSES benchmarks, achieving 59.7 PQ for multimodal panoptic and 78.2 mIoU for semantic segmentation, and also sets the new state of the art on DeLiVER. The source code is publicly available at: https://github.com/timbroed/CAFuser.
Problem

Research questions and friction points this paper is trying to address.

Autonomous Vehicles
Sensor Fusion
Environmental Conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Sensor Fusion
Adaptive Feature Compatibility
Autonomous Driving Performance Enhancement
🔎 Similar Papers
No similar papers found.
T
Tim Brödermann
Computer Vision Laboratory, ETH Zurich, 8057 Zurich, Switzerland
Christos Sakaridis
Christos Sakaridis
Lecturer / Principal Investigator, ETH Zurich
Computer VisionArtificial IntelligenceMachine LearningAutonomous Cars
Y
Yuqian Fu
Computer Vision Laboratory, ETH Zurich, 8057 Zurich, Switzerland; INSAIT, Sofia University St. Kliment Ohridski, Bulgaria
Luc Van Gool
Luc Van Gool
professor computer vision INSAIT Sofia University, em. KU Leuven, em. ETHZ, Toyota Lab TRACE
computer visionmachine learningAIautonomous carscultural heritage