When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

171K/year
🤖 AI Summary
In single-stream autoregressive generation, the coupling of model state updates with output emission incurs a "silence tax": delaying outputs harms responsiveness, while emitting too early leads to erroneous commitments. This work proposes Side-by-Side (SxS), a novel interleaved inference framework that explicitly models disclosure timing as a controllable decision, dynamically alternating between private reasoning and partial public output within the same context and releasing content only when reasoning is sufficiently mature. The approach learns dual-action semantics via supervised fine-tuning and recovers performance through reinforcement learning, implemented on Qwen3 (both 30B MoE and 4B dense variants). Evaluated on benchmarks such as AIME25 and GPQA-Diamond, SxS significantly improves the Pareto trade-off between accuracy and latency, outperforming existing streaming and non-streaming strategies.
📝 Abstract
In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{task-relevant} content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce \textbf{\emph{Side-by-Side (SxS)}} Interleaved Reasoning, which makes \emph{disclosure timing} a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is \emph{supported} by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoE \textbf{Qwen3-30B-A3B}, dense \textbf{Qwen3-4B}) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--\emph{content-latency} Pareto trade-offs under token-level proxies (e.g., inter-update waiting).
Problem

Research questions and friction points this paper is trying to address.

disclosure timing
silence tax
autoregressive generation
reasoning latency
premature commitment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Side-by-Side Interleaved Reasoning
disclosure timing
silence tax
autoregressive generation
reasoning-performance trade-off
🔎 Similar Papers
No similar papers found.