🤖 AI Summary
To address the challenges of ambiguous action spaces and inefficient exploration in reinforcement learning (RL) fine-tuning of large language models (LLMs), this paper proposes the first structured latent action space framework tailored for LLMs. Our method decouples semantic decision-making from token generation, embedding compact latent actions into pretrained LLMs via a learnable latent action encoder. It integrates an enhanced Monte Carlo Tree Search (MCTS) algorithm with RLHF-based training, preserving original model capabilities while improving controllability and semantic diversity. Evaluated on Llama-3.1-8B, our approach achieves a Math500 score of 68.2 (+25.8 over baseline), demonstrates consistent improvements across diverse agent tasks, and reduces training computational overhead by 50%. The core contribution is the first learnable, interpretable, and RL-friendly latent action representation for LLMs—effectively bridging symbolic reasoning and neural text generation.
📝 Abstract
Adapting Large Language Models (LLMs) to downstream tasks using Reinforcement Learning (RL) has proven to be an effective approach. However, LLMs do not inherently define the structure of an agent for RL training, particularly in terms of defining the action space. This paper studies learning a compact latent action space to enhance the controllability and exploration of RL for LLMs. We propose Controlling Large Language Models with Latent Actions (CoLA), a framework that integrates a latent action space into pre-trained LLMs. We apply CoLA to the Llama-3.1-8B model. Our experiments demonstrate that, compared to RL with token-level actions, CoLA's latent action enables greater semantic diversity in text generation. For enhancing downstream tasks, we show that CoLA with RL achieves a score of 42.4 on the math500 benchmark, surpassing the baseline score of 38.2, and reaches 68.2 when augmented with a Monte Carlo Tree Search variant. Furthermore, CoLA with RL consistently improves performance on agent-based tasks without degrading the pre-trained LLM's capabilities, unlike the baseline. Finally, CoLA reduces computation time by half in tasks involving enhanced thinking prompts for LLMs by RL. These results highlight CoLA's potential to advance RL-based adaptation of LLMs for downstream applications.