🤖 AI Summary
This work addresses the challenges of multimodal data alignment and the instability in training monolithic models under complex inputs, which suffer from an excessively large policy space. To overcome these issues, the authors propose CRONA, a scalable cross-modal navigation framework grounded in multi-agent reinforcement learning. CRONA employs lightweight, modality-specific agents that collaborate through both homogeneous and heterogeneous coordination mechanisms. It further enhances cooperative efficiency via control-aware auxiliary belief modeling and a centralized multimodal critic. Evaluated on vision-and-sound navigation tasks, CRONA significantly outperforms single-agent baselines, demonstrating the efficacy and necessity of heterogeneous multimodal collaboration in large-scale environments.
📝 Abstract
Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substantially enlarge the policy space. Cross-modal collaboration among lightweight modality-specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of each modality. In this paper, we propose \textbf{CRONA}, a Multi-Agent Reinforcement Learning (MARL) framework for \textbf{Cro}ss-Modal \textbf{Na}vigation. CRONA improves collaboration by leveraging control-relevant auxiliary beliefs and a centralized multi-modal critic with global state. Experiments on visual-acoustic navigation tasks show that multi-agent methods significantly improve performance and efficiency over single-agent baselines. We find that homogeneous collaboration with limited modalities is sufficient for short-range navigation under salient cues; heterogeneous collaboration among agents with complementary modalities is generally efficient and effective; and navigation in large, complex environments requires both richer multi-modal perception and increased model capacity.