$phi$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the exploration-exploitation imbalance in large language model (LLM) inference, this paper proposes Foresight Sampling: a novel decoding framework that generates candidate reasoning paths via foresighted simulation, then jointly samples from a unified distribution—comprising both a foresight distribution and a clustering-based latent-space distribution—to select globally optimal inference steps. It further incorporates lightweight width/depth-adaptive pruning and Monte Carlo step-value estimation to enable dynamic computational resource allocation. To our knowledge, this is the first inference-time decoding strategy that explicitly models and dynamically balances exploration and exploitation. Evaluated across seven benchmarks, Foresight Sampling consistently outperforms search-based baselines—including Tree of Thought (ToT) and Depth-First Search (DFS)—achieving up to 3.2× speedup in inference latency. The method generalizes across diverse LLM architectures (e.g., Llama, Qwen, Phi-3) and scales seamlessly from single-GPU to cluster-level compute budgets.

Technology Category

Application Category

📝 Abstract
Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named $phi$-Decoding. To provide a precise and expressive estimation of step value, $phi$-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show $phi$-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at https://github.com/xufangzhi/phi-Decoding, and the open-source PyPI package is coming soon.
Problem

Research questions and friction points this paper is trying to address.

Balances exploration and exploitation in inference-time optimization
Proposes foresight sampling for globally optimal step estimation
Enhances inference efficiency with adaptive computation allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Foresight sampling for optimal step estimation
In-width and in-depth pruning for efficiency
Joint distribution sampling for step selection
🔎 Similar Papers
No similar papers found.