Think Twice, Act Once: A Co-Evolution Framework of LLM and RL for Large-Scale Decision Making

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale industrial decision-making—such as power grid dispatch with action spaces exceeding 60,000 dimensions—faces dual bottlenecks: limited long-horizon real-time reasoning capability in large language models (LLMs) and low sample efficiency of reinforcement learning (RL) in high-dimensional action spaces. Method: We propose a bidirectional co-evolutionary framework integrating LLMs and RL: the LLM serves both as a policy executor and a trajectory-level value evaluator, while RL generates high-quality fine-tuning data to iteratively improve the LLM. Our approach incorporates multi-step reasoning, environment-verified execution, trajectory-level reward shaping, prioritized experience replay, and joint fine-tuning. Results: Experiments on multiple real-world power grid dispatch tasks demonstrate that our framework significantly outperforms standalone LLM and RL baselines, achieving superior decision accuracy, inference efficiency, and cross-scenario generalization—establishing a scalable paradigm for high-dimensional sequential decision-making.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Models (LLMs) and Reinforcement Learning (RL) have shown significant promise in decision-making tasks. Nevertheless, for large-scale industrial decision problems, both approaches face distinct challenges: LLMs lack real-time long-sequence decision-making capabilities, while RL struggles with sample efficiency in vast action spaces. To bridge this gap, we propose Agents Co-Evolution (ACE), a synergistic framework between LLMs and RL agents for large-scale decision-making scenarios. ACE introduces a dual-role trajectory refinement mechanism where LLMs act as both Policy Actor and Value Critic during RL's training: the Actor refines suboptimal actions via multi-step reasoning and environment validation, while the Critic performs temporal credit assignment through trajectory-level reward shaping. Concurrently, RL agent enhances LLMs' task-specific decision-making with high-quality fine-tuning datasets generated via prioritized experience replay. Through extensive experiments across multiple power grid operation challenges with action spaces exceeding 60K discrete actions, ACE demonstrates superior performance over existing RL methods and LLM-based methods.
Problem

Research questions and friction points this paper is trying to address.

LLMs lack real-time long-sequence decision-making capabilities
RL struggles with sample efficiency in vast action spaces
Need for synergistic framework for large-scale industrial decision problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM and RL co-evolution framework ACE
Dual-role trajectory refinement mechanism
Prioritized experience replay for fine-tuning
🔎 Similar Papers
2023-08-22Frontiers Comput. Sci.Citations: 866
Xu Wan
Xu Wan
Zhejiang University
Reinforcement LearningLarge Language ModelLarge-scale Application
W
Wenyue Xu
Tongji University
C
Chao Yang
Alibaba DAMO Academy
M
Mingyang Sun
Peking University