OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work proposes a unified vision encoder capable of jointly supporting both image understanding and generation tasks. The method feeds VAE-compressed image latents into a Vision Transformer (ViT) encoder and, for the first time, co-trains generative and discriminative capabilities within a single encoder through joint optimization of reconstruction loss and image–text semantic alignment. By integrating contrastive learning, a Regularized Autoencoder (RAE) generative architecture, and multimodal alignment objectives, the framework achieves competitive performance: with the encoder frozen, it matches CLIP in understanding accuracy (62.4 vs. 62.2 on SeedBench) while significantly outperforming it in generation quality (ImageNet gFID of 1.89 vs. 2.54). This demonstrates the effectiveness of the approach in learning task-agnostic, efficient visual representations.

Technology Category

Application Category

📝 Abstract

This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.

Problem

Research questions and friction points this paper is trying to address.

unified visual encoder

image understanding

image generation

visual representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified visual representation

joint optimization

vision encoder

image understanding and generation