🤖 AI Summary
In single-stream autoregressive generation, the coupling of model state updates with output emission incurs a "silence tax": delaying outputs harms responsiveness, while emitting too early leads to erroneous commitments. This work proposes Side-by-Side (SxS), a novel interleaved inference framework that explicitly models disclosure timing as a controllable decision, dynamically alternating between private reasoning and partial public output within the same context and releasing content only when reasoning is sufficiently mature. The approach learns dual-action semantics via supervised fine-tuning and recovers performance through reinforcement learning, implemented on Qwen3 (both 30B MoE and 4B dense variants). Evaluated on benchmarks such as AIME25 and GPQA-Diamond, SxS significantly improves the Pareto trade-off between accuracy and latency, outperforming existing streaming and non-streaming strategies.
📝 Abstract
In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{task-relevant} content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce \textbf{\emph{Side-by-Side (SxS)}} Interleaved Reasoning, which makes \emph{disclosure timing} a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is \emph{supported} by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoE \textbf{Qwen3-30B-A3B}, dense \textbf{Qwen3-4B}) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--\emph{content-latency} Pareto trade-offs under token-level proxies (e.g., inter-update waiting).