🤖 AI Summary
This work addresses the limitation of existing unified diffusion models, which typically support only a single conditioning input and struggle to flexibly integrate heterogeneous visual conditions. To overcome this, we propose a post-training framework that introduces multi-attribute tokens—representing style, content, subject, and identity—into a diffusion Transformer, along with a novel selective interleaving mechanism. This enables compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Built upon the Bagel unified backbone, our approach leverages 700k interleaved text–image sequences for post-training to construct efficient multi-attribute embedding and conditional fusion modules. Experiments demonstrate that our method significantly outperforms Bagel in compositional generation tasks, achieving notable improvements in controllability, cross-condition consistency, and visual quality.
📝 Abstract
Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.