🤖 AI Summary
Existing open-source unified multimodal large language models exhibit a significant trade-off between image understanding and generation capabilities. To address this, we propose a scalable unified framework featuring a hybrid visual tokenization mechanism: a shared vision encoder coupled with two lightweight adapters enables simultaneous continuous-feature-based understanding and discrete-token-based generation. Built upon a unified autoregressive LLM architecture, the model processes image-to-text understanding via continuous embeddings and text-to-image generation via discrete image tokens; an auxiliary diffusion decoder further facilitates pixel-level reconstruction. Crucially, both tasks are co-optimized within a shared semantic space to mitigate cross-modal interference. Experiments demonstrate state-of-the-art performance in text-rich scenarios—matching specialized models—while exhibiting strong scaling properties and minimal task interference.
📝 Abstract
Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.