Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the spatial correlations and uncertainties in travel demand and trip duration in city-scale electric ride-hailing fleet operations by proposing a unified optimization framework that jointly coordinates dispatching, repositioning, and charging decisions while strictly respecting charger and power feeder capacity constraints. The problem is formulated as a semi-Markov decision process over a hexagonal grid, integrating discrete actions with continuous charging power. A high-level intent policy with masking and temperature annealing generates feasible actions, which are then projected in real time via rolling mixed-integer linear programming to ensure physical feasibility. Innovatively, the approach incorporates distributionally robust reinforcement learning based on Wasserstein-1 ambiguity sets, employs a graph-aligned Mahalanobis metric to capture spatial dependencies, and introduces a zero-constraint-violation action generation mechanism. Evaluated on New York City taxi data, the method achieves an annualized net profit of \$1.22 million—substantially outperforming Greedy, SAC, MAPPO, and MADDPG baselines (\$0.58–\$0.70 million)—with no feeder overloads throughout the simulation.

📝 Abstract

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD--RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M--\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

Problem

Research questions and friction points this paper is trying to address.

electric-vehicle ride-hailing

semi-Markov decision process

physical feasibility

spatially correlated demand

charging constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-Markov Decision Process

Feasibility-Guaranteed RL

Distributionally Robust Optimization