$phi$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address the exploration-exploitation imbalance in large language model (LLM) inference, this paper proposes Foresight Sampling: a novel decoding framework that generates candidate reasoning paths via foresighted simulation, then jointly samples from a unified distribution—comprising both a foresight distribution and a clustering-based latent-space distribution—to select globally optimal inference steps. It further incorporates lightweight width/depth-adaptive pruning and Monte Carlo step-value estimation to enable dynamic computational resource allocation. To our knowledge, this is the first inference-time decoding strategy that explicitly models and dynamically balances exploration and exploitation. Evaluated across seven benchmarks, Foresight Sampling consistently outperforms search-based baselines—including Tree of Thought (ToT) and Depth-First Search (DFS)—achieving up to 3.2× speedup in inference latency. The method generalizes across diverse LLM architectures (e.g., Llama, Qwen, Phi-3) and scales seamlessly from single-GPU to cluster-level compute budgets.

Technology Category

Application Category

📝 Abstract

Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named $phi$-Decoding. To provide a precise and expressive estimation of step value, $phi$-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show $phi$-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at https://github.com/xufangzhi/phi-Decoding, and the open-source PyPI package is coming soon.

Problem

Research questions and friction points this paper is trying to address.

Balances exploration and exploitation in inference-time optimization

Proposes foresight sampling for globally optimal step estimation

Enhances inference efficiency with adaptive computation allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Foresight sampling for optimal step estimation

In-width and in-depth pruning for efficiency

Joint distribution sampling for step selection

🔎 Similar Papers

Uncertainty-Guided Optimization on Large Language Model Search Trees