Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Most-vote aggregation fails in open-ended reasoning tasks (e.g., code generation, web-based deep research) due to the unbounded solution space. To address this, we propose ThinkMerge: a training-free, parallel inference fusion method. Instead of voting over complete outputs, ThinkMerge synchronously averages the logits of the next token across K parallel generation paths at designated alignment points, yielding a single coherent response. This is the first work to apply test-time parallel scaling to open-domain reasoning and natively supports mainstream inference engines (e.g., vLLM, SGLang) and standard decoding strategies (e.g., top-p, top-k). Experiments show that ThinkMerge matches traditional majority voting on AIME and GPQA, improves pass@1 by +8.28% on LiveCodeBench (hard) for DeepCoder-14B-Preview and by +7.58% for Qwen3-8B, and significantly enhances performance of autonomous agents (e.g., WebSailor) on multiple deep-research benchmarks.

Technology Category

Application Category

📝 Abstract
Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a"majority"over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.
Problem

Research questions and friction points this paper is trying to address.

Aggregating parallel reasoning traces for open-ended tasks
Applying logit averaging instead of majority voting
Improving performance in code generation and web research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Averages parallel reasoning traces' next-token logits
Seamlessly integrates with vLLM/SGLang and standard decoding
Improves open-ended coding and web-based research tasks
H
Haonan Wang
National University of Singapore
C
Chao Du
Sea AI Lab, Singapore
Kenji Kawaguchi
Kenji Kawaguchi
Presidential Young Professor, National University of Singapore
LLMsLarge language modelDeep learningAI
T
Tianyu Pang
Sea AI Lab, Singapore