The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work proposes a training-free method for continuous text-conditioned image editing. Addressing the limitations of existing training-free approaches in semantic consistency and editing continuity, the method leverages a large language model to automatically generate debiased contrastive prompt pairs, constructs a semantics-guided direction vector in the text embedding space, and employs an elastic-range search strategy to enable smooth and controllable cross-modal (image and video) editing. Relying solely on text embedding interpolation without any additional training or manual intervention, the proposed approach significantly outperforms current training-free methods in editing quality, continuity, and semantic fidelity—matching the performance of training-based techniques—and introduces a novel metric to quantitatively assess editing continuity.

Technology Category

Application Category

📝 Abstract

We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.

Problem

Research questions and friction points this paper is trying to address.

continuous image editing

text-conditioned generation

steering vector

semantic control

training-free editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

text embedding interpolation

training-free editing

continuous image steering