🤖 AI Summary
The fundamental question—“Does generation facilitate understanding?”—remains insufficiently validated under large-scale multimodal data. This paper introduces UniHetero, a unified architecture that, for the first time, empirically demonstrates on over 200 million samples that semantic-level image generation—not pixel-level reconstruction—significantly enhances visual understanding. Methodologically, we propose an autoregressive decoding paradigm in the input embedding space to enable vision-language joint pretraining. Our key contributions are threefold: (1) We uncover the intrinsic mechanism by which semantic generation boosts understanding; (2) We establish its superior data scalability and utilization efficiency; and (3) We show that our approach outperforms pure understanding models on downstream understanding tasks (e.g., image classification), with understanding performance consistently improving as training data scale increases, while also strengthening fine-grained visual representation learning.
📝 Abstract
Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.