🤖 AI Summary
To address the poor real-time inference robustness of distributed multimodal systems under uncertain communication latency, this paper proposes a neuro-inspired non-blocking inference paradigm. Departing from reliance on a reference modality, our approach introduces a latency-aware framework integrating online latency estimation, adaptive temporal windowing for dynamic ensemble, and asynchronous multimodal fusion—enabling fine-grained accuracy–latency trade-off control across heterogeneous data streams. The core innovation is a learnable temporal integration window that adaptively adjusts fusion timing according to each modality’s real-time latency distribution, significantly enhancing system resilience to network fluctuations. Evaluated on audio-visual event localization, our method achieves a 5.2% improvement in mean average precision (mAP) while maintaining low end-to-end latency, outperforming state-of-the-art approaches in both inference stability and cross-scenario generalization.
📝 Abstract
Connected cyber-physical systems perform inference based on real-time inputs from multiple data streams. Uncertain communication delays across data streams challenge the temporal flow of the inference process. State-of-the-art (SotA) non-blocking inference methods rely on a reference-modality paradigm, requiring one modality input to be fully received before processing, while depending on costly offline profiling. We propose a novel, neuro-inspired non-blocking inference paradigm that primarily employs adaptive temporal windows of integration (TWIs) to dynamically adjust to stochastic delay patterns across heterogeneous streams while relaxing the reference-modality requirement. Our communication-delay-aware framework achieves robust real-time inference with finer-grained control over the accuracy-latency tradeoff. Experiments on the audio-visual event localization (AVEL) task demonstrate superior adaptability to network dynamics compared to SotA approaches.