Directional Textual Inversion for Personalized Text-to-Image Generation

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Textual inversion (TI) suffers from semantic distortion and subject inconsistency under complex prompts due to embedding norm inflation. We observe that CLIP token embeddings encode semantics primarily in their direction, while their magnitude is redundant and interferes with the conditional control of pre-normalized Transformers. To address this, we propose **Directional TI**, which constrains learned token embeddings to the unit hypersphere—optimizing only direction while fixing magnitude. We introduce Riemannian stochastic gradient descent and a von Mises–Fisher prior to ensure stable training on the sphere. Notably, our method enables, for the first time, semantics-preserving spherical linear interpolation (slerp) for TI embeddings. Experiments demonstrate significant improvements in text fidelity, subject similarity, and interpolation coherence under complex prompts, alongside enhanced training stability and zero-shot hyperparameter sensitivity.

Technology Category

Application Category

📝 Abstract

Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.

Problem

Research questions and friction points this paper is trying to address.

Addresses embedding norm inflation in Textual Inversion

Improves text fidelity for personalized image generation

Enables smooth interpolation between learned concepts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes embedding direction on hypersphere via Riemannian SGD

Fixes embedding magnitude to in-distribution scale

Uses von Mises-Fisher prior for constant-direction gradient

🔎 Similar Papers

No similar papers found.