🤖 AI Summary
This study investigates why parallel sampling outperforms sequential sampling in large reasoning models, despite the latter’s theoretically superior representational capacity. Through systematic controlled experiments and attribution analyses on diverse models—including Qwen3, distilled DeepSeek-R1, and Gemini 2.5—across mathematical and programming tasks, the work proposes and validates three key hypotheses. The findings reveal that the underperformance of sequential sampling stems primarily from insufficient exploration, rather than limitations in aggregation mechanisms or context length constraints. This insight advances the understanding of how sampling strategies influence reasoning performance in large language models and underscores the critical role of exploration breadth in effective model-based reasoning.
📝 Abstract
Large Reasoning Models (LRMs) have shown remarkable performance on challenging questions, such as math and coding. However, to obtain a high quality solution, one may need to sample more than once. In principal, there are two sampling strategies that can be composed to form more complex processes: sequential sampling and parallel sampling. In this paper, we first compare these two approaches with rigor, and observe, aligned with previous works, that parallel sampling seems to outperform sequential sampling even though the latter should have more representation power. To understand the underline reasons, we make three hypothesis on the reason behind this behavior: (i) parallel sampling outperforms due to the aggregator operator; (ii) sequential sampling is harmed by needing to use longer contexts; (iii) sequential sampling leads to less exploration due to conditioning on previous answers. The empirical evidence on various model families and sizes (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding) suggests that the aggregation and context length do not seem to be the main culprit behind the performance gap. In contrast, the lack of exploration seems to play a considerably larger role, and we argue that this is one main cause for the performance gap.