🤖 AI Summary
Diffusion models excel in multimodal generation, yet fine-grained, cross-modal, and training-free controllable editing remains challenging. This paper introduces the first universal, tuning-free “concept slider” framework: it extracts semantic directions via text-based contrastive learning and enables continuous, inference-time editing of images, audio, and video by partial denoising—preserving irrelevant content. Innovatively, we propose automatic saturation-point detection and nonlinear reparameterization to enhance editing uniformity and semantic consistency. We further construct the first fine-grained, tri-modal control benchmark, defining and quantifying three evaluation metrics: editing fidelity, directional alignment, and perceptual smoothness. Experiments demonstrate that our method is plug-and-play, computationally efficient, and robust across modalities, significantly outperforming existing baselines in both qualitative and quantitative assessments.
📝 Abstract
Diffusion models have become state-of-the-art generative models for images, audio, and video, yet enabling fine-grained controllable generation, i.e., continuously steering specific concepts without disturbing unrelated content, remains challenging. Concept Sliders (CS) offer a promising direction by discovering semantic directions through textual contrasts, but they require per-concept training and architecture-specific fine-tuning (e.g., LoRA), limiting scalability to new modalities. In this work we introduce FreeSliders, a simple yet effective approach that is fully training-free and modality-agnostic, achieved by partially estimating the CS formula during inference. To support modality-agnostic evaluation, we extend the CS benchmark to include both video and audio, establishing the first suite for fine-grained concept generation control with multiple modalities. We further propose three evaluation properties along with new metrics to improve evaluation quality. Finally, we identify an open problem of scale selection and non-linear traversals and introduce a two-stage procedure that automatically detects saturation points and reparameterizes traversal for perceptually uniform, semantically meaningful edits. Extensive experiments demonstrate that our method enables plug-and-play, training-free concept control across modalities, improves over existing baselines, and establishes new tools for principled controllable generation. An interactive presentation of our benchmark and method is available at: https://azencot-group.github.io/FreeSliders/