π€ AI Summary
This work addresses the poor interpretability of existing reinforcement learningβbased traffic signal control methods and the instability of fine-tuning large language models under sparse, delayed rewards. To overcome these challenges, the authors propose a reinforcement fine-tuning approach that integrates a reward-thresholding mechanism with uncertainty-aware regularization. This method effectively filters out weak learning signals, enhances decision consistency, and generates natural language explanations to improve transparency. Evaluated on the LibSignal benchmark using the LLaMA3-8B model, the approach reduces travel time by 75% and queue length by 67% compared to the pre-trained baseline. Notably, it also demonstrates strong zero-shot transferability across intersections, achieving a 17% reduction in travel time and a 39% decrease in queue length without additional fine-tuning.
π Abstract
Transparent decision-making is essential for traffic signal control (TSC) systems to earn public trust. However, traditional reinforcement learning-based TSC methods function as black boxes with limited interpretability. Although large language models (LLMs) can provide natural language reasoning, reinforcement finetuning for TSC remains unstable because feedback is sparse and delayed, while most actions produce only marginal changes in congestion metrics. We introduce OracleTSC, which stabilizes LLM-based TSC through two mechanisms: (1) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental rewards, and (2) uncertainty regularization that maximizes the probability of the selected response to encourage consistent decisions across sampled outputs. Experiments on the LibSignal benchmark show that OracleTSC enables a compact LLaMA3-8B model to substantially improve traffic efficiency, achieving a 75% reduction in travel time and a 67% decrease in queue length compared with the pretrained baseline while preserving interpretability through natural language explanations. OracleTSC also demonstrates strong cross-intersection generalization: a policy trained on one intersection transfers to a structurally different intersection with 17% lower travel time and 39% lower queue length without additional finetuning. These results suggest that uncertainty-aware reward shaping can improve the stability and effectiveness of reinforcement fine-tuning for TSC.