🤖 AI Summary
This study addresses a critical limitation in large language models (LLMs) when deployed as autonomous agents: their inability to reliably perform endogenous stochastic sampling from prescribed probability distributions, resulting in outputs that significantly deviate from the target distributions. The work presents the first systematic characterization and quantification of this “pseudo-randomness” problem. Through comprehensive experiments spanning diverse model architectures, scales, prompting strategies, and target distributions—and leveraging controlled random seed generation—the authors rigorously evaluate LLMs’ intrinsic sampling capabilities. The findings reveal pervasive sampling biases across mainstream models; even state-of-the-art systems only partially mitigate the issue when supplied with external random seeds, falling short of achieving genuinely reliable endogenous randomness.
📝 Abstract
In this work, we demonstrate that reliable stochastic sampling is a fundamental yet unfulfilled requirement for Large Language Models (LLMs) operating as agents. Agentic systems are frequently required to sample from distributions, often inferred from observed data, a process which needs to be emulated by the LLM. This leads to a distinct failure point: while standard RL agents rely on external sampling mechanisms, LLMs fail to map their internal probability estimates to their stochastic outputs. Through rigorous empirical analysis across multiple model families, model sizes, prompting styles, and distributions, we demonstrate the extent of this failure. Crucially, we show that while powerful frontier models can convert provided random seeds to target distributions, their ability to sample directly from specific distributions is fundamentally flawed.