🤖 AI Summary
This work addresses the inherent incompatibility of local conditional distributions in masked language models (MLMs), which often fail to correspond to any consistent global joint distribution. The authors model the iterative resampling process of MLMs as a Glauber dynamics Markov chain over discrete sequence spaces and employ rectangularity tests to verify this incompatibility. Leveraging Markov chain theory, high-temperature contraction analysis, and empirical semantic trajectory studies, they uncover a phase transition in mixing behavior governed by temperature and sequence length. Theoretically, they establish that mixing time scales as O(n log n) in the high-temperature regime, while exhibiting exponentially slow escape in the low-temperature regime. Experiments confirm the predicted phase transition and reveal persistent semantic traps aligning with theoretical expectations.
📝 Abstract
Masked language models (MLMs) define local conditional distributions over tokens but do not, in general, correspond to any consistent joint distribution over sequences. This raises a fundamental question: what global distributional behavior is induced when such conditionals are used iteratively for generation? We address this question by modeling iterative masked-token resampling as a Glauber dynamics Markov chain on the discrete space of token sequences. We first show that MLM conditionals are intrinsically incompatible: we introduce a rectangle test that certifies this incompatibility and empirically verify its prevalence across modern MLMs. We then provide a theoretical analysis of the induced Markov chain. Under bounded cross-token influence, we establish a high-temperature contraction result implying $O(n\log n)$ mixing time where $n$ is the sequence length. In contrast, we prove that under a uniform local margin condition, the chain exhibits metastability, with exponentially slow escape from semantic basins at low temperatures. Empirically, we demonstrate a phase transition in mixing behavior as a function of temperature and sequence length, consistent with the theoretical predictions. We further characterize the induced stationary behavior through semantic trajectories, identifying persistent structures such as long-lived traps and recurrent semantic basins, with political content serving as a measurable case study.