Mode-Conditioning Unlocks Superior Test-Time Scaling

๐Ÿ“… 2025-11-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In parallel sampling, diversity collapse causes models to converge prematurely to a limited set of erroneous reasoning patterns, severely hindering test-time scaling. To address this, we propose ModC (Mode-Conditioned sampling), a novel framework that introduces an unsupervised gradient clustering mechanism to automatically discover implicit reasoning modesโ€”without requiring human annotations. ModC then leverages mode-conditioned prefixes and expert models to dynamically allocate computational resources across distinct reasoning paths. The method is model-agnostic, seamlessly integrates into both training and inference pipelines, and significantly enhances policy diversity in reinforcement learning. Empirical evaluation shows a 4ร— improvement in sampling efficiency on OpenThoughts, substantial gains in Pass@k, and up to 10% absolute accuracy improvement on benchmarks including NuminaMath.

Technology Category

Application Category

๐Ÿ“ Abstract
Parallel sampling promises substantial gains in test-time scaling, but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates test-time compute across reasoning modes using either specialist models or mode-specific prefixes. ModC consistently improves scaling across controlled graph-search tasks and large-scale reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves a 4x efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without explicit mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves reinforcement learning (RL) and can further boost diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in test-time scaling.
Problem

Research questions and friction points this paper is trying to address.

Addresses diversity collapse in parallel sampling during test-time scaling
Proposes mode-conditioning to allocate compute across reasoning modes
Improves efficiency and performance across models and tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mode-conditioning framework allocates compute across reasoning modes
Uses specialist models or mode-specific prefixes to prevent diversity collapse
Improves scaling across tasks and model sizes without explicit labels
๐Ÿ”Ž Similar Papers
No similar papers found.