🤖 AI Summary
This work addresses the challenge that large language models (LLMs) struggle to flexibly switch behavioral modes during inference—such as transitioning from step-by-step reasoning to direct answering—and lack learnable mechanisms for behavior control. To this end, the paper proposes Token-Conditioned Reinforcement Learning (ToCoRL), a novel framework that uncovers and leverages the “behavioral plasticity” inherent in LLMs. By prefixing specific tokens to guide dynamic behavioral adaptation and integrating reinforcement learning to convert transient adjustments into stable, learnable policies, ToCoRL enables precise control over model behavior. The approach significantly enhances efficiency and performance on tasks like factual question answering while preserving the model’s original capabilities in complex mathematical reasoning.
📝 Abstract
In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity-akin to chameleons adapting their coloration to environmental cues-that can be exposed through token-conditional generation and stabilized via reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reasoning to direct answering) without retraining. Based on this insight, we propose Token-Conditioned Reinforcement Learning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns. ToCoRL guides exploration with token-conditional generation and keep enhancing exploitation, enabling emergence of appropriate behaviors. Extensive experiments show that ToCoRL enables precise behavioral control without capability degradation. Notably, we show that large reasoning models, while performing strongly on complex mathematics, can be effectively adapted to excel at factual question answering, which was a capability previously hindered by their step-by-step reasoning patterns.