SpotDiff: Spotting and Disentangling Interference in Feature Space for Subject-Preserving Image Generation

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Personalized image generation faces a fundamental trade-off between identity fidelity and controllable editing: optimization-based methods achieve high fidelity but suffer from low efficiency, whereas learning-based approaches are efficient yet vulnerable to entanglement caused by pose, background, and other confounding factors. This work proposes SpotDiff, which explicitly disentangles subject identity from pose/background interference in the CLIP feature space via orthogonal constraints, and introduces a dual-expert network to separately model identity-preserving and interference-driven variations. Trained on our newly constructed SpotDiff10k dataset—comprising only 10,000 samples—SpotDiff significantly outperforms prior methods, achieving state-of-the-art performance in both subject preservation rate and text-guided editing accuracy. Its core contribution is the first lightweight, interpretable interference-disentanglement mechanism that simultaneously ensures high fidelity, high efficiency, and strong edit controllability.

Technology Category

Application Category

📝 Abstract
Personalized image generation aims to faithfully preserve a reference subject's identity while adapting to diverse text prompts. Existing optimization-based methods ensure high fidelity but are computationally expensive, while learning-based approaches offer efficiency at the cost of entangled representations influenced by nuisance factors. We introduce SpotDiff, a novel learning-based method that extracts subject-specific features by spotting and disentangling interference. Leveraging a pre-trained CLIP image encoder and specialized expert networks for pose and background, SpotDiff isolates subject identity through orthogonality constraints in the feature space. To enable principled training, we introduce SpotDiff10k, a curated dataset with consistent pose and background variations. Experiments demonstrate that SpotDiff achieves more robust subject preservation and controllable editing than prior methods, while attaining competitive performance with only 10k training samples.
Problem

Research questions and friction points this paper is trying to address.

Disentangling interference in feature space for image generation
Preserving subject identity across diverse text prompts
Achieving robust subject preservation with limited training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spots and disentangles interference in feature space
Uses orthogonality constraints to isolate subject identity
Leverages CLIP encoder with specialized expert networks
🔎 Similar Papers
No similar papers found.
Y
Yongzhi Li
Saining Zhang
Saining Zhang
College of Computing and Data Science, Nanyang Technological University
Computer Vision
Y
Yibing Chen
B
Boying Li
Y
Yanxin Zhang
X
Xiaoyu Du