UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

The fundamental question—“Does generation facilitate understanding?”—remains insufficiently validated under large-scale multimodal data. This paper introduces UniHetero, a unified architecture that, for the first time, empirically demonstrates on over 200 million samples that semantic-level image generation—not pixel-level reconstruction—significantly enhances visual understanding. Methodologically, we propose an autoregressive decoding paradigm in the input embedding space to enable vision-language joint pretraining. Our key contributions are threefold: (1) We uncover the intrinsic mechanism by which semantic generation boosts understanding; (2) We establish its superior data scalability and utilization efficiency; and (3) We show that our approach outperforms pure understanding models on downstream understanding tasks (e.g., image classification), with understanding performance consistently improving as training data scale increases, while also strengthening fine-grained visual representation learning.

Technology Category

Application Category

📝 Abstract

Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.

Problem

Research questions and friction points this paper is trying to address.

Explores if generation tasks enhance vision-language model understanding

Analyzes unified model UniHetero with large-scale pretraining data

Investigates semantic generation's impact on data scaling and utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generating semantics improves understanding, not pixels.

Generation shows superior data scaling and utilization.

Autoregression on input embeddings captures visual details.

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling