Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPs

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Explicit chain-of-thought (CoT) reasoning in large language models (LLMs) incurs high computational overhead and slow inference. Method: We propose an implicit reasoning framework that models LLM inference as abstract actions (“options”) in hierarchical reinforcement learning, enabling temporally extended decision-making in latent space. Our core algorithm, Variational Markov Options Critic (VMOC), theoretically guarantees optimal abstraction via continuous MDP homomorphism; initializes the implicit option space via cold-starting augmented with human demonstration distillation; and integrates variational inference, HiT-MDP modeling, off-policy learning, and supervised fine-tuning (SFT) for efficient acquisition and cross-task transfer of abstract skills. Results: Experiments demonstrate substantial improvements over explicit CoT baselines on complex logical reasoning and embodied control tasks, validating the framework’s effectiveness and generalizability in learning domain-agnostic abstract skills across both language and control modalities.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model "thinks" in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model's reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.

Problem

Research questions and friction points this paper is trying to address.

Develop efficient implicit reasoning in latent space

Learn diverse options as latent embeddings via VMOC

Preserve optimality in abstract reasoning via MDP homomorphisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Markovian Option Critic algorithm

Continuous MDP homomorphisms theory extension

Supervised fine-tuning cold-start procedure

🔎 Similar Papers

A Provably Efficient Option-Based Algorithm for both High-Level and Low-Level Learning