From Kernels to Attention: A Transformer Framework for Density and Score Estimation

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces the first unified attention-based framework for jointly estimating probability density functions and their scores (gradients) directly from i.i.d. samples—addressing the limitations of conventional methods that require distribution-specific training and exhibit poor generalization. Methodologically, density and score estimation are formulated as a permutation- and affine-equivariant sequence-to-sequence task, implemented via a Transformer architecture incorporating cross-attention and built-in symmetry constraints. Crucially, we propose the first distribution-agnostic operator that establishes a theoretical bridge between kernel density estimation (KDE) and Transformers, proving that the model’s attention weights reduce to classical kernel estimators under appropriate conditions. Experiments demonstrate that our approach significantly outperforms both KDE and debiased score-based kernel methods in estimation accuracy, cross-distribution and cross-sample-size generalization, and time complexity—validating Transformers as effective universal nonparametric estimators.

Technology Category

Application Category

📝 Abstract
We introduce a unified attention-based framework for joint score and density estimation. Framing the problem as a sequence-to-sequence task, we develop a permutation- and affine-equivariant transformer that estimates both the probability density $f(x)$ and its score $ abla_x log f(x)$ directly from i.i.d. samples. Unlike traditional score-matching methods that require training a separate model for each distribution, our approach learns a single distribution-agnostic operator that generalizes across densities and sample sizes. The architecture employs cross-attention to connect observed samples with arbitrary query points, enabling generalization beyond the training data, while built-in symmetry constraints ensure equivariance to permutation and affine transformations. Analytically, we show that the attention weights can recover classical kernel density estimation (KDE), and verify it empirically, establishing a principled link between classical KDE and the transformer architecture. Empirically, the model achieves substantially lower error and better scaling than KDE and score-debiased KDE (SD-KDE), while exhibiting better runtime scaling. Together, these results establish transformers as general-purpose, data-adaptive operators for nonparametric density and score estimation.
Problem

Research questions and friction points this paper is trying to address.

Develops a unified transformer framework for joint density and score estimation
Creates distribution-agnostic operator generalizing across densities and sample sizes
Establishes principled link between classical kernel density estimation and transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer framework for joint density and score estimation
Cross-attention connects samples with arbitrary query points
Single distribution-agnostic operator generalizes across densities
🔎 Similar Papers
No similar papers found.