Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address slow convergence and training inefficiency in reinforcement learning (RL) under sparse rewards, this paper proposes a teacher-student framework leveraging large language models (LLMs) as general-purpose strategy teachers. Open-source LLMs—including Llama, Vicuna, and DeepSeek—autonomously generate reusable high-level strategic advice to guide conventional RL agents (e.g., DQN, PPO, A2C). A novel advice reuse mechanism reduces reliance on domain expertise. Experiments across benchmark environments (Blackjack, Snake, Connect Four) demonstrate that LLM guidance accelerates convergence by an average factor of 2.1×; advice reuse further improves sample efficiency while preserving final policy performance comparable to baselines. The study also characterizes the trade-off between training stability and cross-task generalization. To our knowledge, this is the first systematic investigation into the mechanistic role and practical boundaries of LLMs as reusable strategic teachers for sparse-reward RL.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) algorithms often require long training to become useful, especially in complex environments with sparse rewards. While techniques like reward shaping and curriculum learning exist to accelerate training, these are often extremely specific and require the developer's professionalism and dedicated expertise in the problem's domain. Tackling this challenge, in this study, we explore the effectiveness of pre-trained Large Language Models (LLMs) as tutors in a student-teacher architecture with RL algorithms, hypothesizing that LLM-generated guidance allows for faster convergence. In particular, we explore the effectiveness of reusing the LLM's advice on the RL's convergence dynamics. Through an extensive empirical examination, which included 54 configurations, varying the RL algorithm (DQN, PPO, A2C), LLM tutor (Llama, Vicuna, DeepSeek), and environment (Blackjack, Snake, Connect Four), our results demonstrate that LLM tutoring significantly accelerates RL convergence while maintaining comparable optimal performance. Furthermore, the advice reuse mechanism shows a further improvement in training duration but also results in less stable convergence dynamics. Our findings suggest that LLM tutoring generally improves convergence, and its effectiveness is sensitive to the specific task, RL algorithm, and LLM model combination.
Problem

Research questions and friction points this paper is trying to address.

Accelerating Reinforcement Learning convergence using LLM tutors
Reusing LLM advice to improve training duration efficiency
Evaluating LLM tutoring effectiveness across various algorithms and environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained LLMs tutor reinforcement learning algorithms
LLM-generated guidance enables faster convergence rates
Advice reuse mechanism further reduces training duration
L
Lukas Toral
Department of Computing, Jonkoping University, Jonkoping, Sweden
Teddy Lazebnik
Teddy Lazebnik
Assistant Professor
Computational MathematicsScientometricsBiomathematicsSocio-economic simulations