🤖 AI Summary
Reinforcement learning (RL) struggles to model semantic and social interactions in complex highway scenarios due to the difficulty of designing expressive reward functions. Method: We propose an LLM-augmented reward shaping framework that employs a compact, localized large language model (<14B parameters) to semantically score state-action transitions during RL training—thereby refining the reward function—while retaining a lightweight RL policy for real-time, safety-critical decision-making at deployment. Contribution/Results: Experiments show pure RL achieves 73–89% success rates; LLM-only attains 94% but suffers severe latency. Our hybrid approach balances performance (85–91%) and inference speed, and—critically—uncovers systematic conservative bias and model-dependent performance fluctuations in small-scale LLMs for navigation tasks. This work establishes a deployable paradigm for LLM–RL co-design under resource constraints.
📝 Abstract
Autonomous vehicle navigation in complex environments such as dense and fast-moving highways and merging scenarios remains an active area of research. A key limitation of RL is its reliance on well-specified reward functions, which often fail to capture the full semantic and social complexity of diverse, out-of-distribution situations. As a result, a rapidly growing line of research explores using Large Language Models (LLMs) to replace or supplement RL for direct planning and control, on account of their ability to reason about rich semantic context. However, LLMs present significant drawbacks: they can be unstable in zero-shot safety-critical settings, produce inconsistent outputs, and often depend on expensive API calls with network latency. This motivates our investigation into whether small, locally deployed LLMs (< 14B parameters) can meaningfully support autonomous highway driving through reward shaping rather than direct control. We present a case study comparing RL-only, LLM-only, and hybrid approaches, where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time. Our findings reveal that RL-only agents achieve moderate success rates (73-89%) with reasonable efficiency, LLM-only agents can reach higher success rates (up to 94%) but with severely degraded speed performance, and hybrid approaches consistently fall between these extremes. Critically, despite explicit efficiency instructions, LLM-influenced approaches exhibit systematic conservative bias with substantial model-dependent variability, highlighting important limitations of current small LLMs for safety-critical control tasks.