Learning to Reason Across Parallel Samples for LLM Reasoning

๐Ÿ“… 2025-06-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the instability and poor generalization of large language models (LLMs) in reasoning tasks under single-sample inference. To mitigate this, we propose a test-time multi-path reasoning optimization framework. Our core method generates multiple reasoning samples in parallel, concatenates them into a unified sequence, and feeds it to a lightweight, trainable Sample Set Aggregator (SSA) model that performs end-to-end regression to the optimal answer. Crucially, we formulate sample aggregation as a sequence-to-sequence task for the first timeโ€”decoupling sample generation from aggregation and enabling plug-and-play integration with black-box LLMs. Training employs reinforcement learning combined with self-supervision, eliminating the need for human annotations. On mathematical and logical reasoning benchmarks, our approach achieves average accuracy gains of 3.2โ€“7.8 percentage points over reward-model-based re-ranking baselines. Moreover, it demonstrates strong cross-sample-size, cross-LLM-family, and cross-task generalization.

Technology Category

Application Category

๐Ÿ“ Abstract
Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on multiple reasoning datasets show that SSA outperforms other test-time scaling methods such as reward model-based re-ranking. Our approach also shows a promising generalization ability, across sample set sizes, base model families and scales, and tasks. By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.
Problem

Research questions and friction points this paper is trying to address.

Improving LLM reasoning via multi-sample aggregation
Training compact LLM to optimize answer accuracy
Enhancing generalization across models and tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Train compact LLM for sample set aggregation
Use reinforcement learning for answer accuracy
Separate answer generation and aggregation processes
๐Ÿ”Ž Similar Papers
No similar papers found.