Loom: Diffusion-Transformer for Interleaved Generation

📅 2025-12-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the problem of interleaved text-image generation, aiming to synthesize semantically aligned and temporally coherent multimodal sequences for applications including style transfer, compositional generation, and procedural tutorial modeling. We propose the first interleaved diffusion-Transformer architecture that jointly models language-guided planning and local frame conditioning, enabling alternating token-level generation of text and images within a unified sequence. Our method employs full-parameter fine-tuning of the Bagel model, incorporating alternating token embeddings, sliding-window historical frame sampling, and global text joint conditioning. Evaluated on a newly curated 50K tutorial dataset and multiple benchmarks, our approach significantly outperforms the Anole baseline: semantic and temporal consistency scores improve by an average of 2.6 points (on a 5-point scale), demonstrating enhanced long-range temporal controllability and cross-modal alignment capability.

Technology Category

Application Category

📝 Abstract

Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.

Problem

Research questions and friction points this paper is trying to address.

Generates interleaved text-image sequences coherently

Enhances temporal consistency and text-image alignment

Improves long-horizon generation efficiency and controllability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion-transformer framework for interleaved text-image generation

Language planning decomposes instructions into stepwise prompts and embeddings

Conditions on sampled prior frames for controllable long-horizon generation

🔎 Similar Papers

No similar papers found.

Authors to Follow