🤖 AI Summary
Most-vote aggregation fails in open-ended reasoning tasks (e.g., code generation, web-based deep research) due to the unbounded solution space. To address this, we propose ThinkMerge: a training-free, parallel inference fusion method. Instead of voting over complete outputs, ThinkMerge synchronously averages the logits of the next token across K parallel generation paths at designated alignment points, yielding a single coherent response. This is the first work to apply test-time parallel scaling to open-domain reasoning and natively supports mainstream inference engines (e.g., vLLM, SGLang) and standard decoding strategies (e.g., top-p, top-k). Experiments show that ThinkMerge matches traditional majority voting on AIME and GPQA, improves pass@1 by +8.28% on LiveCodeBench (hard) for DeepCoder-14B-Preview and by +7.58% for Qwen3-8B, and significantly enhances performance of autonomous agents (e.g., WebSailor) on multiple deep-research benchmarks.
📝 Abstract
Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a"majority"over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.