🤖 AI Summary
In conventional autoregressive generation, large language models (LLMs) discard the full output token distribution after sampling a discrete token at each step, resulting in irreversible information loss that degrades generation quality and reasoning capability. To address this, we propose Mixture of Inputs (MoI), a training-free method that constructs continuous input representations by weighting and fusing the sampled token’s embedding with the posterior expectation vector of its output distribution—thereby preserving and leveraging distributional information throughout generation. MoI introduces zero trainable parameters and incurs negligible computational overhead, enabling, for the first time, end-to-end retention and utilization of distributional information during inference. Evaluated on challenging tasks—including mathematical reasoning, code generation, and doctoral-level question answering—MoI consistently improves performance across diverse models such as QwQ-32B and Nemotron-Super-49B, without requiring fine-tuning, retraining, or architectural modification.
📝 Abstract
In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.