Vision Bridge Transformer at Scale

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low sampling efficiency of conventional diffusion models, this work proposes the Vision Bridge Transformer (ViBT), a direct trajectory modeling approach grounded in Brownian bridges for efficient data-to-data translation. ViBT is the first bridge model scaled to 20B/1.3B parameters, incorporating a variance-stable velocity-matching training objective that substantially improves training stability and convergence at scale. The architecture is entirely transformer-based, enabling instruction-driven image editing and complex video translation. Experiments demonstrate that ViBT achieves state-of-the-art performance across multiple image and video generation benchmarks, validating the effectiveness, scalability, and practicality of large-scale bridge models for multimodal conditional generation.

Technology Category

Application Category

📝 Abstract
We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
Problem

Research questions and friction points this paper is trying to address.

Develops Vision Bridge Transformer for conditional generation tasks
Models direct trajectory between inputs and outputs for translation
Scales bridge models for image editing and video translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bridge Models directly connect inputs to outputs
Scaled Transformer architecture with 20B parameters
Variance-stabilized velocity-matching objective for training
🔎 Similar Papers
No similar papers found.