OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

210K/year
πŸ€– AI Summary
This work addresses the poor interpretability of existing reinforcement learning–based traffic signal control methods and the instability of fine-tuning large language models under sparse, delayed rewards. To overcome these challenges, the authors propose a reinforcement fine-tuning approach that integrates a reward-thresholding mechanism with uncertainty-aware regularization. This method effectively filters out weak learning signals, enhances decision consistency, and generates natural language explanations to improve transparency. Evaluated on the LibSignal benchmark using the LLaMA3-8B model, the approach reduces travel time by 75% and queue length by 67% compared to the pre-trained baseline. Notably, it also demonstrates strong zero-shot transferability across intersections, achieving a 17% reduction in travel time and a 39% decrease in queue length without additional fine-tuning.
πŸ“ Abstract
Transparent decision-making is essential for traffic signal control (TSC) systems to earn public trust. However, traditional reinforcement learning-based TSC methods function as black boxes with limited interpretability. Although large language models (LLMs) can provide natural language reasoning, reinforcement finetuning for TSC remains unstable because feedback is sparse and delayed, while most actions produce only marginal changes in congestion metrics. We introduce OracleTSC, which stabilizes LLM-based TSC through two mechanisms: (1) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental rewards, and (2) uncertainty regularization that maximizes the probability of the selected response to encourage consistent decisions across sampled outputs. Experiments on the LibSignal benchmark show that OracleTSC enables a compact LLaMA3-8B model to substantially improve traffic efficiency, achieving a 75% reduction in travel time and a 67% decrease in queue length compared with the pretrained baseline while preserving interpretability through natural language explanations. OracleTSC also demonstrates strong cross-intersection generalization: a policy trained on one intersection transfers to a structurally different intersection with 17% lower travel time and 39% lower queue length without additional finetuning. These results suggest that uncertainty-aware reward shaping can improve the stability and effectiveness of reinforcement fine-tuning for TSC.
Problem

Research questions and friction points this paper is trying to address.

traffic signal control
reinforcement learning
interpretability
reward sparsity
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward hurdle
uncertainty regularization
traffic signal control
large language models
reinforcement fine-tuning