SWAN: World-Aware Adaptive Multimodal Networks for Runtime Variations

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work addresses the challenge of efficiently utilizing computational resources in deployed multimodal deep neural networks, which often face varying modality quality, fluctuating input complexity, and stringent resource constraints. To this end, the paper proposes a world-aware adaptive multimodal inference framework that jointly models modality quality, input complexity, and computational budget for the first time. The framework employs a quality-aware controller to dynamically allocate modality-specific resources and integrates adaptive gating with a semantic-agnostic feature token pruning mechanism to enable fine-grained computation scheduling. Evaluated on multimodal 3D object detection for autonomous driving, the method achieves up to a 49% reduction in FLOPs with negligible performance degradation, substantially improving computational efficiency.
📝 Abstract
Multimodal deep neural networks deployed in realistic environments must contend with runtime variations: changes in modality quality, overall input complexity, and available platform resources. Current networks struggle with such fluctuations -- adaptive networks cannot adhere to a strict compute budget, controller-based networks neglect to consider input complexity, and statically provisioned networks fail at all the above. Consequently, they do not extract maximum utility from the expended computational resources. We present SWAN (Sample and World-Aware Multimodal Network), the first adaptive multimodal network that accomplishes all three goals. SWAN employs a quality-aware controller to assign resources among modalities according to a variable user-specified maximum budget. Within this budget, an adaptive gating module further optimizes efficiency by scaling layer utilization according to sample complexity. For further gains, SWAN also employs a token dropping module that masks semantically irrelevant multimodal features before performing detections. We evaluate SWAN in the domain of autonomous driving with complex multi-object 3D detection, reducing FLOPs by up to 49% with minimal degradation.
Problem

Research questions and friction points this paper is trying to address.

runtime variations
multimodal networks
compute budget
input complexity
adaptive inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive multimodal networks
runtime adaptation
quality-aware controller
adaptive gating
token dropping