MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

๐Ÿ“… 2025-07-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current text-to-image models face challenges in fine-grained visual control, balanced multimodal input integration, and complex scene generation, often relying on computationally expensive cross-attention mechanisms or auxiliary adapters. This work proposes MENTOR: an efficient, autoregressive, multimodal image generation framework that requires no auxiliary modules. MENTOR achieves end-to-end controllable generation via a two-stage training strategyโ€”first jointly optimizing pixel-level and semantic-level cross-modal alignment, then performing multimodal instruction tuning. This design significantly improves concept preservation, prompt adherence, image reconstruction fidelity, and training efficiency. On DreamBench++, MENTOR consistently outperforms all baselines across diverse evaluation tasks and demonstrates strong generalization in cross-task transfer. By eliminating architectural overhead while enhancing controllability and multimodal synergy, MENTOR establishes a novel paradigm for lightweight, controllable, and cooperative multimodal generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR
Problem

Research questions and friction points this paper is trying to address.

Enables precise visual control in image generation
Balances multimodal inputs for better alignment
Improves training efficiency for complex multimodal generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

AR framework for multimodal image generation
Two-stage training for fine-grained alignment
No auxiliary adapters or cross-attention modules