🤖 AI Summary
To address the efficiency-accuracy trade-off arising from dynamic computational resource fluctuations and degraded multimodal input quality (e.g., noise corruption) in real-world scenarios, this paper proposes a layer-granular cross-modal deep redistribution mechanism to construct a per-layer adaptive multimodal network. Our method jointly optimizes gated layer selection, modality-quality-aware scoring, and resource-aware gradient scheduling, augmented by a lightweight feature recalibration module, enabling real-time, coordinated adjustment of activation depth across modalities. By breaking away from static architectural constraints, the proposed framework preserves state-of-the-art (SOTA) accuracy while reducing floating-point operations by up to 75%. This significantly enhances robustness and energy efficiency on heterogeneous edge devices under dynamic workloads and noisy inputs.
📝 Abstract
Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Current multimodal systems employ static resource provisioning and cannot easily adapt when compute resources change over time. Additionally, their reliance on processing sensor data with fixed feature extractors is ill-equipped to handle variations in modality quality. Consequently, uninformative modalities, such as those with high noise, needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges - it adjusts the total number of active layers across all modalities to meet compute resource constraints, and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.