🤖 AI Summary
This work addresses the high energy consumption and limited battery life in edge-based multimodal healthcare monitoring, where continuous acquisition of ECG, PPG, EMG, and IMU signals leads to significant power overhead and existing approaches fail to exploit temporal redundancy effectively. To overcome these limitations, the authors propose an Adaptive Multimodal Intelligence (AMI) framework that, for the first time, enables end-to-end joint optimization of sensing decisions and inference. The framework incorporates a learnable Sigma-Delta sampling mechanism to dynamically skip redundant samples and a confidence-based intelligent modality controller to activate only necessary sensors. It integrates differentiable Gumbel-Sigmoid gating, unimodal foundation model encoders, and a cross-modal Transformer to achieve hardware-aware sparse sensing and efficient inference. Evaluated on MHEALTH, HMC Sleep, and WESAD datasets, the method reduces average sensor usage by 48.8% while improving classification accuracy by 1.9% over state-of-the-art approaches.
📝 Abstract
Edge-based multimodal medical monitoring requires models that balance diagnostic accuracy with severe energy constraints. Continuous acquisition of ECG, PPG, EMG, and IMU streams rapidly drains wearable batteries, often limiting operation to under 10 hours, while existing systems overlook the high temporal redundancy present in physiological signals. We introduce Adaptive Multimodal Intelligence (AMI), an end-to-end framework that jointly learns when to sense and how to infer. AMI integrates three components: (1) a lightweight Agentic Modality Controller that uses differentiable Gumbel-Sigmoid gating to dynamically select active sensors based on model confidence and task relevance; (2) a Learned Sigma-Delta Sensing module that applies patch-wise Delta-Sigma operations with learnable thresholds to skip temporally redundant samples; and (3) a Foundation-backed Multimodal Prediction Model built on unimodal foundation encoders and a cross-modal transformer with temporal context, enabling robust fusion even under gated or missing inputs. These components are trained jointly via a multi-objective loss combining classification accuracy, sparsity regularization, cross-modal alignment, and predictive coding. AMI is hardware-aware, supporting dynamic computation graphs and masked operations, leading to real energy and latency savings. Across MHEALTH, HMC Sleep, and WESAD datasets, it reduces sensor usage by 48.8% while improving state-of-the-art accuracy by 1.9% on average.