🤖 AI Summary
We address the bandwidth allocation problem in dynamic wireless networks characterized by variable user population, non-stationary channels, and heterogeneous QoS and resource constraints. To this end, we propose a scalable and transferable deep reinforcement learning scheduling framework. Our core contribution is a Hybrid-task Meta-Learning (HML) mechanism integrated with Graph Neural Networks (GNNs) to model user-resource topological relationships—enabling user-count-agnostic policy generalization. The framework further supports rapid online fine-tuning using only a few samples from unseen scenarios. Experiments demonstrate an 8.79% improvement in initial performance and a 73% increase in sampling efficiency. After fine-tuning, the policy approaches optimal performance while significantly reducing inference complexity. This work establishes a new paradigm for intelligent resource scheduling that is efficient, robust, and deployable in dynamic wireless environments.
📝 Abstract
In this paper, we develop a deep learning-based bandwidth allocation policy that is: 1) scalable with the number of users and 2) transferable to different communication scenarios, such as non-stationary wireless channels, different quality-of-service (QoS) requirements, and dynamically available resources. To support scalability, the bandwidth allocation policy is represented by a graph neural network (GNN), with which the number of training parameters does not change with the number of users. To enable the generalization of the GNN, we develop a hybrid-task meta-learning (HML) algorithm that trains the initial parameters of the GNN with different communication scenarios during meta-training. Next, during meta-testing, a few samples are used to fine-tune the GNN with unseen communication scenarios. Simulation results demonstrate that our HML approach can improve the initial performance by $8.79%$, and sampling efficiency by $73%$, compared with existing benchmarks. After fine-tuning, our near-optimal GNN-based policy can achieve close to the same reward with much lower inference complexity compared to the optimal policy obtained using iterative optimization.