Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPs

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Explicit chain-of-thought (CoT) reasoning in large language models (LLMs) incurs high computational overhead and slow inference. Method: We propose an implicit reasoning framework that models LLM inference as abstract actions (“options”) in hierarchical reinforcement learning, enabling temporally extended decision-making in latent space. Our core algorithm, Variational Markov Options Critic (VMOC), theoretically guarantees optimal abstraction via continuous MDP homomorphism; initializes the implicit option space via cold-starting augmented with human demonstration distillation; and integrates variational inference, HiT-MDP modeling, off-policy learning, and supervised fine-tuning (SFT) for efficient acquisition and cross-task transfer of abstract skills. Results: Experiments demonstrate substantial improvements over explicit CoT baselines on complex logical reasoning and embodied control tasks, validating the framework’s effectiveness and generalizability in learning domain-agnostic abstract skills across both language and control modalities.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model "thinks" in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model's reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.
Problem

Research questions and friction points this paper is trying to address.

Develop efficient implicit reasoning in latent space
Learn diverse options as latent embeddings via VMOC
Preserve optimality in abstract reasoning via MDP homomorphisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Markovian Option Critic algorithm
Continuous MDP homomorphisms theory extension
Supervised fine-tuning cold-start procedure
🔎 Similar Papers
No similar papers found.
C
Chang Li
JD Joy Future Academy, China
Y
Yaren Zhang
Carleton University, Canada
H
Haoran Lv
Amazon Web Services, China
Qiong Cao
Qiong Cao
JD Exploration Academy, JD.com
Computer Vision3D Human-centric VisionMachine Learning
Chao Xue
Chao Xue
Beihang University
Natural Language ProcessingLarge Language Model
X
Xiaodong He
JD Joy Future Academy, China