🤖 AI Summary
To address poor generalization and inadequate adaptability to multiple source-destination (S-D) pairs in vessel path planning within confined dynamic waterways, this paper proposes a goal-oriented reinforcement learning framework. Methodologically, it integrates AIS traffic data with ERA5 wind field information to construct a dynamic hexagonal grid-based state representation; designs a target-conditioned multi-discrete action space; incorporates invalid-action masking and positive reward shaping; and employs a recurrent-PPO algorithm for policy optimization. Empirical evaluation in the Gulf of St. Lawrence demonstrates that action masking significantly accelerates convergence and enhances policy stability, while reward shaping reduces fuel consumption (−12.3%) and voyage time (−8.7%), all while preserving path diversity. The core contribution is the first realization of adaptive maritime decision-making for multiple S-D pairs, jointly driven by large-scale traffic data and physical environmental constraints.
📝 Abstract
Routing vessels through narrow and dynamic waterways is challenging due to changing environmental conditions and operational constraints. Existing vessel-routing studies typically fail to generalize across multiple origin-destination pairs and do not exploit large-scale, data-driven traffic graphs. In this paper, we propose a reinforcement learning solution for big maritime data that can learn to find a route across multiple origin-destination pairs while adapting to different hexagonal grid resolutions. Agents learn to select direction and speed under continuous observations in a multi-discrete action space. A reward function balances fuel efficiency, travel time, wind resistance, and route diversity, using an Automatic Identification System (AIS)-derived traffic graph with ERA5 wind fields. The approach is demonstrated in the Gulf of St. Lawrence, one of the largest estuaries in the world. We evaluate configurations that combine Proximal Policy Optimization with recurrent networks, invalid-action masking, and exploration strategies. Our experiments demonstrate that action masking yields a clear improvement in policy performance and that supplementing penalty-only feedback with positive shaping rewards produces additional gains.