๐ค AI Summary
To address the low inference efficiency and high computational overhead of large language models in complex tasks such as mathematical reasoning, this paper proposes a multi-branch parallel decoding method within a single output sequence. The core innovation lies in a custom-designed sparse attention mask that enables simultaneous token generation across multiple independent reasoning paths within one sequence, explicitly modeling parallel reasoning trajectories. Crucially, the method requires no architectural modifications or retrainingโonly decoding-time optimization. Experiments on mainstream mathematical reasoning benchmarks demonstrate over 100% decoding speedup with negligible accuracy degradation. To our knowledge, this is the first work to achieve multi-path parallel reasoning under a single-sequence decoding framework, establishing a new paradigm for long-chain symbolic reasoning that jointly optimizes efficiency and accuracy.
๐ Abstract
Recent advances in reasoning models have demonstrated significant improvements in accuracy, particularly for complex tasks such as mathematical reasoning, by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning branches exist, we decode multiple tokens per step using a specialized attention mask, processing them within a single sequence. Experimental results show that our method achieves over 100% speedup in decoding time while basically maintaining accuracy.