HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the dynamic trade-off between exploration and exploitation in multi-path reasoning with large language models by framing test-time compute scaling as dynamic expansion and contraction of a hypothesis path pool. The authors propose a lightweight, training-free online strategy that enables phase-aware exploration-exploitation transitions, fine-grained path optimization during generation, and answer aggregation informed by both path length and confidence. Evaluated within a Mixture-of-Experts (MoE) multi-path decoding framework across four MoE models and multiple reasoning benchmarks, the approach achieves an 8–10% accuracy improvement while reducing token consumption by 25–40%.

Technology Category

Application Category

📝 Abstract
Scaling test-time compute with multi-path chain-of-thought improves reasoning accuracy, but its effectiveness depends critically on the exploration-exploitation trade-off. Existing approaches address this trade-off in rigid ways: tree-structured search hard-codes exploration through brittle expansion rules that interfere with post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on weak answer selection. Motivated by the observation that the optimal balance is phase-dependent and that correct and incorrect reasoning paths often diverge only at late stages, we reformulate test-time scaling as a dynamic expand-reduce control problem over a pool of hypotheses. We propose HyPER, a training-free online control policy for multi-path decoding in mixture-of-experts models that reallocates computation under a fixed budget using lightweight path statistics. HyPER consists of an online controller that transitions from exploration to exploitation as the hypothesis pool evolves, a token-level refinement mechanism that enables efficient generation-time exploitation without full-path resampling, and a length- and confidence-aware aggregation strategy for reliable answer-time exploitation. Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show that HyPER consistently achieves a superior accuracy-compute trade-off, improving accuracy by 8 to 10 percent while reducing token usage by 25 to 40 percent.
Problem

Research questions and friction points this paper is trying to address.

exploration-exploitation trade-off
LLM reasoning
test-time scaling
multi-path decoding
hypothesis paths
Innovation

Methods, ideas, or system contributions that make the work stand out.

hypothesis path expansion
dynamic exploration-exploitation
training-free control
multi-path reasoning
mixture-of-experts
🔎 Similar Papers
2024-02-26Annual Meeting of the Association for Computational LinguisticsCitations: 97