A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scaling inference in large language models (LLMs) faces diminishing returns and rapidly increasing computational costs, while deterministic search methods remain vulnerable to reward hacking. This paper reformulates inference-time scaling as a probabilistic inference problem and introduces a particle-based Monte Carlo method tailored for state-space models—integrating particle filtering, variational inference, and typical-set approximation. Rather than optimizing deterministic trajectories, the method efficiently samples from the state distribution, thereby avoiding mode collapse and reward bias. The approach substantially improves robustness and scaling efficiency: it achieves 4–16× speedup over deterministic methods on mathematical reasoning tasks; Qwen2.5-Math-1.5B attains performance surpassing GPT-4o with only four rollouts, and its 7B variant reaches o1-level accuracy within 32 rollouts.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code and further information is available at https://probabilistic-inference-scaling.github.io.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Computational Efficiency
Resource Consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic Guessing
Game Strategy
Enhanced Problem Solving Efficiency
🔎 Similar Papers
No similar papers found.