SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing unified diffusion models, which typically support only a single conditioning input and struggle to flexibly integrate heterogeneous visual conditions. To overcome this, we propose a post-training framework that introduces multi-attribute tokens—representing style, content, subject, and identity—into a diffusion Transformer, along with a novel selective interleaving mechanism. This enables compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Built upon the Bagel unified backbone, our approach leverages 700k interleaved text–image sequences for post-training to construct efficient multi-attribute embedding and conditional fusion modules. Experiments demonstrate that our method significantly outperforms Bagel in compositional generation tasks, achieving notable improvements in controllability, cross-condition consistency, and visual quality.

Technology Category

Application Category

📝 Abstract
Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.
Problem

Research questions and friction points this paper is trying to address.

multi-condition generation
diffusion transformer
visual task alignment
heterogeneous sources
unified model
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-condition generation
selective multi-attribute tokens
diffusion transformer
compositional editing
multimodal alignment