🤖 AI Summary
This work addresses the challenge of real-time query correction, which demands both high accuracy and low latency—requirements that existing Chain-of-Thought (CoT) methods struggle to meet due to their substantial reasoning overhead. To overcome this limitation, the authors propose a novel three-stage Answer-Reasoning-Answer paradigm, introducing the Sandwich Reasoning framework: an initial answer is first generated, followed by explicit reasoning that produces a refined answer. Consistency between the initial and final answers is enforced through a consistency-aware reinforcement learning objective. The approach further integrates interval-based rejection sampling and optimized autoregressive decoding. Empirical results demonstrate that this method achieves state-of-the-art accuracy comparable to standard CoT while reducing inference latency by 40%–70%, substantially outperforming current real-time correction techniques.
📝 Abstract
Query correction is a critical entry point in modern search pipelines, demanding high accuracy strictly within real-time latency constraints. Chain-of-Thought (CoT) reasoning improves accuracy but incurs prohibitive latency for real-time query correction. A potential solution is to output an answer before reasoning to reduce latency; however, under autoregressive decoding, the early answer is independent of subsequent reasoning, preventing the model from leveraging its reasoning capability to improve accuracy. To address this issue, we propose Sandwich Reasoning (SandwichR), a novel approach that explicitly aligns a fast initial answer with post-hoc reasoning, enabling low-latency query correction without sacrificing reasoning-aware accuracy. SandwichR follows an Answer-Reasoning-Answer paradigm, producing an initial correction, an explicit reasoning process, and a final refined correction. To align the initial answer with post-reasoning insights, we design a consistency-aware reinforcement learning (RL) strategy: a dedicated consistency reward enforces alignment between the initial and final corrections, while margin-based rejection sampling prioritizes borderline samples where reasoning drives the most impactful corrective gains. Additionally, we construct a high-quality query correction dataset, addressing the lack of specialized benchmarks for complex query correction. Experimental results demonstrate that SandwichR achieves SOTA accuracy comparable to standard CoT while delivering a 40-70% latency reduction, resolving the latency-accuracy trade-off in online search.