🤖 AI Summary
To address the low sampling efficiency of conventional diffusion models, this work proposes the Vision Bridge Transformer (ViBT), a direct trajectory modeling approach grounded in Brownian bridges for efficient data-to-data translation. ViBT is the first bridge model scaled to 20B/1.3B parameters, incorporating a variance-stable velocity-matching training objective that substantially improves training stability and convergence at scale. The architecture is entirely transformer-based, enabling instruction-driven image editing and complex video translation. Experiments demonstrate that ViBT achieves state-of-the-art performance across multiple image and video generation benchmarks, validating the effectiveness, scalability, and practicality of large-scale bridge models for multimodal conditional generation.
📝 Abstract
We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.