HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing text-to-image diffusion models struggle to precisely control generated content when relying solely on textual conditions, often producing unintended editing artifacts due to neglecting the true geometric structure of the embedding space. This work proposes the HEART framework, which reveals for the first time that text embeddings lie on a hypersphere and follow a Kent distribution. Building upon this insight, the authors introduce a geometry-aware editing method that requires no fine-tuning, inversion, or optimization. By leveraging geodesic transformations on the hypersphere, the approach enables high-fidelity subject replacement and fine-grained attribute manipulation while preserving semantic consistency. It overcomes the limitations of conventional linear assumptions, demonstrates cross-model generalizability, and significantly enhances both the accuracy and efficiency of image editing.

📝 Abstract

Text-to-image diffusion models can generate visually stunning images, yet, controlling what appears and how it appears, remains surprisingly difficult, especially when operating solely within the constraints of the text-conditioning space. For example, changing a subject or adjusting an attribute often leads to unintended side effects, such as altered backgrounds or distorted details. This is because most existing text-based control methods treat the embedding space as Euclidean and apply simple linear transformations, which do not reflect how semantic concepts are actually organized. In this work, we take a step back and ask: what is the true geometry of these embeddings? We find that text encoder representations lie on a hypersphere, where concepts are not linear directions but structured, anisotropic distributions better captured by Kent distributions. Building on this insight, we propose HEART, a training-free framework that performs Kent-aware geodesic transformations directly on the hypersphere. By respecting the underlying geometry, HEART enables intuitive and precise edits, such as consistent subject replacement and fine-grained attribute control, while preserving the original scene. Importantly, HEART requires no finetuning, inversion, or optimization, and generalizes across diffusion model architectures. Our results show that a simple shift in perspective, from linear to spherical, can unlock fast, and controllable image generation.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

embedding geometry

semantic control

diffusion models

hyperspherical representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

hyperspherical embedding

Kent distribution

geodesic transformation