Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) models frequently suffer from attribute leakage, identity entanglement, and subject omission under multi-subject prompts, severely degrading generation fidelity. To address this, we propose an optimization-based sampling control framework grounded in flow matching and stochastic optimal control—establishing, for the first time, a unified theoretical model for multi-subject fidelity that subsumes mainstream attention mechanisms within a single analytical paradigm. We introduce a training-free, test-time control strategy and a lightweight fine-tuning method, Adjoint Matching, enabling cross-model generalization. By incorporating adjoint signals to enable control network regression, we develop the FOCUS algorithm. Evaluated on Stable Diffusion and FLUX, FOCUS significantly improves multi-subject alignment accuracy while preserving native stylistic integrity. The test-time controller runs in real time on consumer-grade GPUs, and the fine-tuned models exhibit strong generalization across diverse architectures and prompts.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.
Problem

Research questions and friction points this paper is trying to address.

Addressing multi-subject fidelity issues in text-to-image generation
Preventing attribute leakage and identity entanglement in image synthesis
Developing control algorithms for subject disentanglement in diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses stochastic optimal control for flow matching
Develops training-free test-time velocity perturbation
Introduces lightweight fine-tuning via Adjoint Matching
E
Eric Tillmann Bill
ETH Zurich
Enis Simsar
Enis Simsar
ETH Zurich
Computer Vision
T
Thomas Hofmann
ETH Zurich