🤖 AI Summary
To address the insufficient subject fidelity and text-image alignment in zero-shot subject-driven image generation, this paper proposes a negative-sample-guided contrastive learning framework. The method operates entirely on pre-trained diffusion models without fine-tuning or additional supervision. Its core contributions are: (1) a Conditionally Degraded Negative Sampling (CDNS) strategy that automatically generates semantically relevant yet identity-mismatched negative samples—without manual annotation; and (2) a dynamic timestep reweighting mechanism that strengthens the diffusion model’s capacity to model subject-specific features during critical detail-generation timesteps. Evaluated across multiple subject-driven benchmarks, the approach achieves significant improvements in identity fidelity (+12.7%) and text-image alignment (+9.3%), marking the first zero-shot method to jointly optimize both metrics.
📝 Abstract
We present Subject Fidelity Optimization (SFO), a novel comparative learning framework for zero-shot subject-driven generation that enhances subject fidelity. Beyond supervised fine-tuning methods that rely only on positive targets and use the diffusion loss as in the pre-training stage, SFO introduces synthetic negative targets and explicitly guides the model to favor positives over negatives through pairwise comparison. For negative targets, we propose Condition-Degradation Negative Sampling (CDNS), which automatically generates distinctive and informative negatives by intentionally degrading visual and textual cues without expensive human annotations. Moreover, we reweight the diffusion timesteps to focus finetuning on intermediate steps where subject details emerge. Extensive experiments demonstrate that SFO with CDNS significantly outperforms baselines in terms of both subject fidelity and text alignment on a subject-driven generation benchmark. Project page: https://subjectfidelityoptimization.github.io/