LaVin-DiT: Large Vision Diffusion Transformer

๐Ÿ“… 2024-11-18
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing large vision models inherit autoregressive architectures from NLP, resulting in computational inefficiency and disruption of intrinsic image spatial structure. Method: This paper proposes a scalable, unified generative vision foundation model. We introduce the first Spatial-Temporal Variational Autoencoder (ST-VAE) jointly trained with a diffusion-based Transformer to enable conditional generation in the latent space. Furthermore, we propose a context-learningโ€“driven, multi-task unified training paradigm that supports zero-shot cross-task generalization. Contribution/Results: The model scales from 0.1B to 3.4B parameters and achieves state-of-the-art performance across 20+ diverse vision tasks. It significantly improves scalability and generalization capability, enabling zero-shot transfer without task-specific fine-tuning.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models are available.
Problem

Research questions and friction points this paper is trying to address.

Addresses high dimensionality in visual data using spatial-temporal variational autoencoder.
Develops joint diffusion transformer for progressive vision output generation.
Implements in-context learning for unified multi-task training without fine-tuning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-temporal variational autoencoder for encoding
Joint diffusion transformer for generative modeling
In-context learning for unified multi-task training
๐Ÿ”Ž Similar Papers
No similar papers found.