The Universal Normal Embedding

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the long-standing disconnect between generative models and visual encoders, which have evolved largely in isolation without a unified understanding of their潜在 shared structural properties. The paper proposes the Universal Normal Embedding (UNE) hypothesis, positing that the noise obtained via DDIM inversion and embeddings from vision encoders such as CLIP or DINO are linear projections within the same approximately Gaussian latent space. Leveraging the NoiseZoo dataset, the study employs linear probing, orthogonal disentanglement, and Gaussian modeling to reveal— for the first time—the shared latent structure underlying both representations. On CelebA, this framework enables high-accuracy attribute prediction and high-quality controllable editing (e.g., expression, gender, age) without any architectural modifications, thereby validating the semantic consistency and editing efficacy implied by the UNE hypothesis.

Technology Category

Application Category

📝 Abstract
Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/
Problem

Research questions and friction points this paper is trying to address.

Universal Normal Embedding
latent space
generative models
vision encoders
Gaussianity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Normal Embedding
Gaussian latent space
DDIM inversion
controllable generation
linear semantic directions