Semantic Generative Tuning for Unified Multimodal Models

๐Ÿ“… 2026-05-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

196K/year
๐Ÿค– AI Summary
This work addresses the misalignment between visual understanding and generation in unified multimodal models, which arises from disjoint training objectives and hinders their synergistic improvement. To bridge this gap, the authors propose Semantic Generation Tuning (SGT), a novel generative post-training paradigm that leverages image segmentation as an optimal high-level semantic proxy task to jointly align and optimize both capabilities within a single architecture. By using structured semantic signals to guide visionโ€“language attention allocation, SGT enhances feature linear separability and significantly improves both visual understanding accuracy and generation layout fidelity across major multimodal benchmarks, achieving mutual enhancement of comprehension and synthesis abilities.
๐Ÿ“ Abstract
Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.
Problem

Research questions and friction points this paper is trying to address.

unified multimodal models
visual understanding
visual generation
representation alignment
generative post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Generative Tuning
Unified Multimodal Models
Generative Post-Training
Image Segmentation as Proxy
Visual-Textual Alignment