🤖 AI Summary
This study investigates whether large language models implicitly plan for future linguistic goals—such as rhyming words or specific answers—while generating the current token. To this end, the authors propose a lightweight and scalable evaluation framework that leverages vector intervention techniques to inject guiding signals at the end of preceding context and quantifies the extent to which intermediate tokens are influenced by anticipated future targets. Applying this method, they provide the first systematic evidence of implicit planning in models with over 1 billion parameters, demonstrating that such models can be reliably steered toward generating desired outcomes (e.g., words ending in “-ight” or the word “whale”). This approach offers a novel tool for probing internal model mechanisms and enhancing controllability in artificial intelligence systems.
📝 Abstract
Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross-layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g."-ight") or answer to a question ("whale") can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.