Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of insufficient cross-modal alignment in unified multimodal generative models, often caused by sparse textual prompts and redundant supervision signals. To this end, we propose the Semantic Grounding Supervision (SeGroS) framework, which introduces—for the first time—a dual-path supervision mechanism driven by visual grounding maps. One path generates semantic visual prompts to enrich semantic representation in textually sparse regions, while the other constructs semantically anchored corrupted inputs to steer the masked reconstruction loss toward text-aligned critical areas. This approach effectively mitigates both semantic sparsity and supervisory redundancy, significantly enhancing generation fidelity and cross-modal alignment. Extensive experiments on GenEval, DPGBench, and CompBench demonstrate the method’s generality and superiority across diverse unified multimodal architectures.

Technology Category

Application Category

📝 Abstract
Unified Multimodal Models (UMMs) have emerged as a promising paradigm that integrates multimodal understanding and generation within a unified modeling framework. However, current generative training paradigms suffer from inherent limitations. We present Semantically-Grounded Supervision (SeGroS), a fine-tuning framework designed to resolve the granularity mismatch and supervisory redundancy in UMMs. At its core, we propose a novel visual grounding map to construct two complementary supervision signals. First, we formulate semantic Visual Hints to compensate for the sparsity of text prompts. Second, we generate a semantically-grounded Corrupted Input to explicitly enhance the supervision of masking-based UMMs by restricting the reconstruction loss to core text-aligned regions. Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.
Problem

Research questions and friction points this paper is trying to address.

Unified Multimodal Models
cross-modal alignment
supervisory redundancy
granularity mismatch
generation fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantically-Grounded Supervision
Visual Grounding Map
Unified Multimodal Models
Visual Hints
Corrupted Input
🔎 Similar Papers
No similar papers found.