🤖 AI Summary
This work investigates how attention mechanisms can perform effective Bayesian inference and denoising under full-token corruption. We interpret single-layer attention as a kernel-weighted posterior mean estimator based on the empirical distribution of contextual tokens, and characterize the progressive refinement of this empirical distribution in deep networks through particle dynamics. By incorporating long-range skip connections, the architecture realizes a two-stage inference process. Theoretical analysis demonstrates that, under fixed kernel bandwidth and finite integration time, the method achieves effective denoising without explicit noise scheduling, and the empirical estimator asymptotically converges to the Bayes-optimal predictor. This study elucidates the distinct roles of network depth and attention residuals in statistical inference and provides theoretical guarantees for posterior mean recovery.
📝 Abstract
We study minimal attention-only transformers under all-token corruption and show they admit a two-stage empirical Bayes interpretation. A single attention step computes a kernel-weighted posterior mean with respect to the empirical distribution defined by the context. Depth refines this distribution through particle dynamics (Stage 1), while a long-range skip-connection carries the noisy input as a query for posterior inference (Stage 2), revealing distinct statistical roles for depth and attention residuals. The framework isolates a minimal setting in which the context itself induces a depth-dependent energy landscape governing in-context inference. We show that effective denoising can emerge without an explicit noise schedule: a fixed kernel bandwidth and finite integration horizon suffice, yielding a principled depth-noise relationship. We further establish a posterior-mean recovery guarantee for a class of well-behaved priors, where the empirical estimator converges to the Bayes-optimal predictor under asymptotic conditions. Connecting these dynamics to reverse-diffusion limits, our results provide a statistical interpretation of attention as in-context inference via sample-based posterior estimation, without explicit density modeling.