An Automated Reinforcement Learning Reward Design Framework with Large Language Model for Cooperative Platoon Coordination

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manually designing reward functions for cooperative platoon coordination is time-consuming, poorly generalizable, and struggles to balance multiple objectives. Method: This paper formally introduces the Platoon Coordination Reward Design Problem (PCRDP) and proposes an automated reward design framework powered by large language models (LLMs). The framework comprises: (1) an AIR initialization module that mitigates LLM code hallucination via environment code parsing and constraint-guided prompting; (2) an iterative optimization mechanism integrating chain-of-thought reasoning with evolutionary algorithms to balance exploration diversity and convergence stability; and (3) end-to-end symbolic reward function generation from task specifications and simulation environments. Contribution/Results: Evaluated across six complex traffic scenarios in the Yangtze River Delta, LLM-generated reward functions improve reinforcement learning agent performance by 10% on average—outperforming all manually designed baselines.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has demonstrated excellent decision-making potential in platoon coordination problems. However, due to the variability of coordination goals, the complexity of the decision problem, and the time-consumption of trial-and-error in manual design, finding a well performance reward function to guide RL training to solve complex platoon coordination problems remains challenging. In this paper, we formally define the Platoon Coordination Reward Design Problem (PCRDP), extending the RL-based cooperative platoon coordination problem to incorporate automated reward function generation. To address PCRDP, we propose a Large Language Model (LLM)-based Platoon coordination Reward Design (PCRD) framework, which systematically automates reward function discovery through LLM-driven initialization and iterative optimization. In this method, LLM first initializes reward functions based on environment code and task requirements with an Analysis and Initial Reward (AIR) module, and then iteratively optimizes them based on training feedback with an evolutionary module. The AIR module guides LLM to deepen their understanding of code and tasks through a chain of thought, effectively mitigating hallucination risks in code generation. The evolutionary module fine-tunes and reconstructs the reward function, achieving a balance between exploration diversity and convergence stability for training. To validate our approach, we establish six challenging coordination scenarios with varying complexity levels within the Yangtze River Delta transportation network simulation. Comparative experimental results demonstrate that RL agents utilizing PCRD-generated reward functions consistently outperform human-engineered reward functions, achieving an average of 10% higher performance metrics in all scenarios.
Problem

Research questions and friction points this paper is trying to address.

Automating reward function design for RL in platoon coordination
Addressing variability and complexity in cooperative platoon goals
Reducing manual trial-and-error in RL training optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven automated reward function initialization
Iterative evolutionary optimization for reward tuning
Chain-of-thought mitigation of LLM hallucination risks
D
Dixiao Wei
Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, 201210, China
P
Peng Yi
Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, 201210, China; Department of Control Science and Engineering, Tongji University, Shanghai, 201804, China
Jinlong Lei
Jinlong Lei
Department of Control Science and Engineering, Tongji University
game theorystochastic optimizationdistributed optimizationstochastic approximationmulti-agent systems
Yiguang Hong
Yiguang Hong
Institute of Systems Science, Chinese Academy of Sciences
Multi-agent systemsdistributed optimization/gamenonlinear dynamics and controlmachine learningautomata
Yuchuan Du
Yuchuan Du
Professor of Transportation Engineering, Tongji University
Connected and Automated VehiclesSmart Infrastructure