🤖 AI Summary
This work addresses the challenge of balancing spatial pattern learning and efficient trajectory planning in multi-agent collaborative exploration and service tasks within unknown environments, where existing approaches often suffer from limited sample efficiency or poor policy adaptability. The authors propose a hybrid belief-based reinforcement learning framework that first constructs a shared spatial belief using a Log-Gaussian Cox process and employs a Pathwise Mutual Information planner to generate information-driven exploration trajectories. Subsequently, control is transferred to Soft Actor-Critic agents, initialized via a novel dual-channel knowledge transfer mechanism for policy warm-starting. By innovatively integrating Bayesian spatial modeling with deep reinforcement learning—and incorporating a variance-normalized overlap penalty—the method achieves a 10.8% higher cumulative reward and 38% faster convergence than baseline approaches in multi-UAV wireless service tasks, with ablation studies confirming the superiority of dual-channel over single-channel knowledge transfer.
📝 Abstract
Coordinating multiple autonomous agents to explore and serve spatially heterogeneous demand requires jointly learning unknown spatial patterns and planning trajectories that maximize task performance. Pure model-based approaches provide structured uncertainty estimates but lack adaptive policy learning, while deep reinforcement learning often suffers from poor sample efficiency when spatial priors are absent. This paper presents a hybrid belief-reinforcement learning (HBRL) framework to address this gap. In the first phase, agents construct spatial beliefs using a Log-Gaussian Cox Process (LGCP) and execute information-driven trajectories guided by a Pathwise Mutual Information (PathMI) planner with multi-step lookahead. In the second phase, trajectory control is transferred to a Soft Actor-Critic (SAC) agent, warm-started through dual-channel knowledge transfer: belief state initialization supplies spatial uncertainty, and replay buffer seeding provides demonstration trajectories generated during LGCP exploration. A variance-normalized overlap penalty enables coordinated coverage through shared belief state, permitting cooperative sensing in high-uncertainty regions while discouraging redundant coverage in well-explored areas. The framework is evaluated on a multi-UAV wireless service provisioning task. Results show 10.8% higher cumulative reward and 38% faster convergence over baselines, with ablation studies confirming that dual-channel transfer outperforms either channel alone.