FreeSliders: Training-Free, Modality-Agnostic Concept Sliders for Fine-Grained Diffusion Control in Images, Audio, and Video

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models excel in multimodal generation, yet fine-grained, cross-modal, and training-free controllable editing remains challenging. This paper introduces the first universal, tuning-free “concept slider” framework: it extracts semantic directions via text-based contrastive learning and enables continuous, inference-time editing of images, audio, and video by partial denoising—preserving irrelevant content. Innovatively, we propose automatic saturation-point detection and nonlinear reparameterization to enhance editing uniformity and semantic consistency. We further construct the first fine-grained, tri-modal control benchmark, defining and quantifying three evaluation metrics: editing fidelity, directional alignment, and perceptual smoothness. Experiments demonstrate that our method is plug-and-play, computationally efficient, and robust across modalities, significantly outperforming existing baselines in both qualitative and quantitative assessments.

Technology Category

Application Category

📝 Abstract
Diffusion models have become state-of-the-art generative models for images, audio, and video, yet enabling fine-grained controllable generation, i.e., continuously steering specific concepts without disturbing unrelated content, remains challenging. Concept Sliders (CS) offer a promising direction by discovering semantic directions through textual contrasts, but they require per-concept training and architecture-specific fine-tuning (e.g., LoRA), limiting scalability to new modalities. In this work we introduce FreeSliders, a simple yet effective approach that is fully training-free and modality-agnostic, achieved by partially estimating the CS formula during inference. To support modality-agnostic evaluation, we extend the CS benchmark to include both video and audio, establishing the first suite for fine-grained concept generation control with multiple modalities. We further propose three evaluation properties along with new metrics to improve evaluation quality. Finally, we identify an open problem of scale selection and non-linear traversals and introduce a two-stage procedure that automatically detects saturation points and reparameterizes traversal for perceptually uniform, semantically meaningful edits. Extensive experiments demonstrate that our method enables plug-and-play, training-free concept control across modalities, improves over existing baselines, and establishes new tools for principled controllable generation. An interactive presentation of our benchmark and method is available at: https://azencot-group.github.io/FreeSliders/
Problem

Research questions and friction points this paper is trying to address.

Achieving fine-grained control in diffusion models without training
Enabling modality-agnostic concept editing across images, audio, video
Automating scale selection for perceptually uniform semantic edits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free concept control across modalities
Partial estimation of concept sliders during inference
Automatic saturation detection for uniform semantic edits
🔎 Similar Papers
No similar papers found.