UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The fundamental question—“Does generation facilitate understanding?”—remains insufficiently validated under large-scale multimodal data. This paper introduces UniHetero, a unified architecture that, for the first time, empirically demonstrates on over 200 million samples that semantic-level image generation—not pixel-level reconstruction—significantly enhances visual understanding. Methodologically, we propose an autoregressive decoding paradigm in the input embedding space to enable vision-language joint pretraining. Our key contributions are threefold: (1) We uncover the intrinsic mechanism by which semantic generation boosts understanding; (2) We establish its superior data scalability and utilization efficiency; and (3) We show that our approach outperforms pure understanding models on downstream understanding tasks (e.g., image classification), with understanding performance consistently improving as training data scale increases, while also strengthening fine-grained visual representation learning.

Technology Category

Application Category

📝 Abstract
Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.
Problem

Research questions and friction points this paper is trying to address.

Explores if generation tasks enhance vision-language model understanding
Analyzes unified model UniHetero with large-scale pretraining data
Investigates semantic generation's impact on data scaling and utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generating semantics improves understanding, not pixels.
Generation shows superior data scaling and utilization.
Autoregression on input embeddings captures visual details.
🔎 Similar Papers
No similar papers found.
F
Fengjiao Chen
Meituan, Beijing, China.
M
Minhao Jing
Meituan, Beijing, China.
W
Weitao Lu
Meituan, Beijing, China.
Yan Feng
Yan Feng
Hangzhou Institute of Advanced Study, UCAS
Raman lasersfiber lasersnonlinear photonicslaser guide staroptical magnetometry
X
Xiaoyu Li
Meituan, Beijing, China.
Xuezhi Cao
Xuezhi Cao
Meituan
Data MiningKnowledge GraphLLMs