LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Current unified multimodal pretraining incurs high computational costs, underperforms task-specific models in downstream tasks, and suffers from slow image generation—hindering real-time deployment. To address these limitations, we propose a Layerwise Timestep-Expert Flow (LTEF) architecture built upon a pretrained vision-language model (VLM). LTEF integrates flow matching with a hierarchical Transformer, introducing two novel components: timestep-conditioned residual attention and a layerwise expert mechanism. These innovations decouple the flow matching process and enhance cross-layer feature reuse. The framework jointly models visual understanding and generation: it retains state-of-the-art (SOTA) performance on understanding benchmarks while achieving generation quality competitive with advanced diffusion models. Crucially, inference speed improves by approximately 6× over baseline diffusion approaches. Overall, LTEF significantly advances the efficiency and practicality of unified multimodal models.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Unifies image understanding and generation efficiently

Improves slow image generation in multimodal models

Enhances performance with specialized layerwise timestep experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layerwise Timestep Experts flow-based architecture

Flow-matching across specialized Transformer layers

Timestep-Conditioned Residual Attention mechanism

🔎 Similar Papers

No similar papers found.