🤖 AI Summary
This work addresses the challenges of multimodal asynchrony, high latency, and poor generalization in robotic response under dynamic, heterogeneous multimedia commands. To overcome these limitations, the authors propose a modality-agnostic, lightweight streaming architecture that aligns asynchronous audio-visual instructions into a unified latent space and leverages meta-reinforcement learning to model diverse instructions as navigable goal distributions. The approach achieves robust real-time responses to noisy inputs with negligible inference overhead and substantially improves sample efficiency. Experimental results on multi-arm manipulation tasks demonstrate that the method maintains real-time control while significantly outperforming baseline approaches in noise robustness and generalization capability.
📝 Abstract
Interpreting dynamic, heterogeneous multimedia commands with real-time responsiveness is critical for Human-Robot Interaction. We present VA-FastNavi-MARL, a framework that aligns asynchronous audio-visual inputs into a unified latent representation. By treating diverse instructions as a distribution of navigable goals via Meta-Reinforcement Learning, our method enables rapid adaptation to unseen directives with negligible inference overhead. Unlike approaches bottlenecked by heavy sensory processing, our modality-agnostic stream ensures seamless, low-latency control. Validation on a multi-arm workspace confirms that VA-FastNavi-MARL significantly outperforms baselines in sample efficiency and maintains robust, real-time execution even under noisy multimedia streams.