🤖 AI Summary
Explicit chain-of-thought (CoT) reasoning in large language models (LLMs) incurs high computational overhead and slow inference.
Method: We propose an implicit reasoning framework that models LLM inference as abstract actions (“options”) in hierarchical reinforcement learning, enabling temporally extended decision-making in latent space. Our core algorithm, Variational Markov Options Critic (VMOC), theoretically guarantees optimal abstraction via continuous MDP homomorphism; initializes the implicit option space via cold-starting augmented with human demonstration distillation; and integrates variational inference, HiT-MDP modeling, off-policy learning, and supervised fine-tuning (SFT) for efficient acquisition and cross-task transfer of abstract skills.
Results: Experiments demonstrate substantial improvements over explicit CoT baselines on complex logical reasoning and embodied control tasks, validating the framework’s effectiveness and generalizability in learning domain-agnostic abstract skills across both language and control modalities.
📝 Abstract
Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model "thinks" in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model's reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.