MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing open-source unified multimodal large language models exhibit a significant trade-off between image understanding and generation capabilities. To address this, we propose a scalable unified framework featuring a hybrid visual tokenization mechanism: a shared vision encoder coupled with two lightweight adapters enables simultaneous continuous-feature-based understanding and discrete-token-based generation. Built upon a unified autoregressive LLM architecture, the model processes image-to-text understanding via continuous embeddings and text-to-image generation via discrete image tokens; an auxiliary diffusion decoder further facilitates pixel-level reconstruction. Crucially, both tasks are co-optimized within a shared semantic space to mitigate cross-modal interference. Experiments demonstrate state-of-the-art performance in text-rich scenarios—matching specialized models—while exhibiting strong scaling properties and minimal task interference.

Technology Category

Application Category

📝 Abstract

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

Problem

Research questions and friction points this paper is trying to address.

Reduces performance trade-off between visual understanding and generation

Unifies multimodal capabilities in a single scalable framework

Enables joint learning of image-to-text and text-to-image tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid vision tokenizer for unified multimodal learning

Shared encoder with dual adapters for understanding and generation

Autoregressive LLM with auxiliary diffusion decoder for images

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM