Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing LLM inference architectures—such as Chain-of-Thought—conflate high-level planning with low-level execution, resulting in inefficiency, limited path exploration, and poor interpretability. To address this, we propose the Explore-Execute Chain (E²C), the first framework to decouple reasoning into two distinct phases: *exploration* (generating abstract, high-level plans) and *execution* (deterministically unfolding those plans). This design enables efficient test-time scaling and cross-domain transfer while reducing computational overhead and enhancing controllability over reasoning paths. Methodologically, E²C integrates supervised fine-tuning (SFT) with a novel data synthesis strategy and incorporates reinforcement learning to optimize plan fidelity and execution determinism. On AIME’2024, E²C achieves 58.1% accuracy using fewer than 10% of the decoding tokens required by baseline methods. Furthermore, EF-SFT—a variant optimized for medical tasks—outperforms standard SFT by 14.5% accuracy using only 3.5% more tokens, establishing a new state-of-the-art.

Technology Category

Application Category

📝 Abstract

Chain-of-Thought (CoT) and its variants have markedly advanced the reasoning abilities of Large Language Models (LLMs), yet their monolithic and auto-regressive architecture inherently conflates high-level strategic planning with low-level step-by-step execution, leading to computational inefficiency, limited exploration of reasoning paths, and reduced interpretability. To overcome these issues, we propose the Explore-Execute Chain ($E^2C$), a structured reasoning framework that decouples reasoning into two distinct phases: an exploratory phase that stochastically generates succinct high-level plans, followed by an execution phase that deterministically carries out the chosen plan. Our approach incorporates a two-stage training methodology, which combines Supervised Fine-Tuning (SFT) - augmented by a novel data generation algorithm enforcing strict plan adherence - with a subsequent Reinforcement Learning (RL) stage that capitalizes on the informativeness of exploration and reinforces the determinism of execution.This decomposition enables an efficient test-time scaling strategy: on AIME'2024, $E^2C$ Test Time Scaling reaches 58.1% accuracy using <10% of the decoding tokens required by comparable methods (e.g., Forest-of-Thought), sharply cutting self-consistency overhead. For cross-domain adaptation, our Exploration-Focused SFT (EF-SFT) fine-tunes with only 3.5% of the tokens used by standard SFT yet yields up to 14.5% higher accuracy than standard SFT on medical benchmarks, delivering state-of-the-art performance, strong generalization, and greater interpretability by separating planning from execution. The code and pre-trained models for the project are available at: https://github.com/yks23/Explore-Execute-Chain.git

Problem

Research questions and friction points this paper is trying to address.

Decouples reasoning into exploration and execution phases

Reduces computational overhead while improving accuracy

Enables efficient cross-domain adaptation with minimal fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples reasoning into exploratory and execution phases

Uses two-stage training with SFT and reinforcement learning

Achieves high accuracy with reduced decoding tokens

🔎 Similar Papers

SymBa: Symbolic Backward Chaining for Structured Natural Language Reasoning