🤖 AI Summary
Text-to-image (T2I) models frequently suffer from attribute leakage, identity entanglement, and subject omission under multi-subject prompts, severely degrading generation fidelity. To address this, we propose an optimization-based sampling control framework grounded in flow matching and stochastic optimal control—establishing, for the first time, a unified theoretical model for multi-subject fidelity that subsumes mainstream attention mechanisms within a single analytical paradigm. We introduce a training-free, test-time control strategy and a lightweight fine-tuning method, Adjoint Matching, enabling cross-model generalization. By incorporating adjoint signals to enable control network regression, we develop the FOCUS algorithm. Evaluated on Stable Diffusion and FLUX, FOCUS significantly improves multi-subject alignment accuracy while preserving native stylistic integrity. The test-time controller runs in real time on consumer-grade GPUs, and the fine-tuned models exhibit strong generalization across diverse architectures and prompts.
📝 Abstract
Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.