🤖 AI Summary
This work bridges the gap between the practical inference behavior of large language models (LLMs) and their theoretical analysis, focusing on the intrinsic mechanisms by which test-time computation—such as chain-of-thought reasoning and multi-candidate sampling—improves performance. We study in-context linear regression as a canonical task and introduce a novel theoretical framework that explicitly models decoding stochasticity and uncertainty via noise injection and binary/continuous coefficient sampling. Crucially, this is the first framework to incorporate realistic LLM inference dynamics—including sampling-based generation and inherent randomness—into a rigorous, analytically tractable paradigm that remains empirically verifiable. Our theoretical analysis demonstrates how test-time computation mitigates overfitting and enhances generalization. Extensive experiments on synthetic and semi-realistic datasets consistently validate the framework’s predictions. The result is an interpretable, scalable theoretical foundation for understanding LLM inference beyond static, deterministic assumptions.
📝 Abstract
Using more test-time computation during language model inference, such as generating more intermediate thoughts or sampling multiple candidate answers, has proven effective in significantly improving model performance. This paper takes an initial step toward bridging the gap between practical language model inference and theoretical transformer analysis by incorporating randomness and sampling. We focus on in-context linear regression with continuous/binary coefficients, where our framework simulates language model decoding through noise injection and binary coefficient sampling. Through this framework, we provide detailed analyses of widely adopted inference techniques. Supported by empirical results, our theoretical framework and analysis demonstrate the potential for offering new insights into understanding inference behaviors in real-world language models.