🤖 AI Summary
Textual inversion (TI) suffers from semantic distortion and subject inconsistency under complex prompts due to embedding norm inflation. We observe that CLIP token embeddings encode semantics primarily in their direction, while their magnitude is redundant and interferes with the conditional control of pre-normalized Transformers. To address this, we propose **Directional TI**, which constrains learned token embeddings to the unit hypersphere—optimizing only direction while fixing magnitude. We introduce Riemannian stochastic gradient descent and a von Mises–Fisher prior to ensure stable training on the sphere. Notably, our method enables, for the first time, semantics-preserving spherical linear interpolation (slerp) for TI embeddings. Experiments demonstrate significant improvements in text fidelity, subject similarity, and interpolation coherence under complex prompts, alongside enhanced training stability and zero-shot hyperparameter sensitivity.
📝 Abstract
Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.