🤖 AI Summary
This work addresses the challenge of insufficient cross-modal alignment in unified multimodal generative models, often caused by sparse textual prompts and redundant supervision signals. To this end, we propose the Semantic Grounding Supervision (SeGroS) framework, which introduces—for the first time—a dual-path supervision mechanism driven by visual grounding maps. One path generates semantic visual prompts to enrich semantic representation in textually sparse regions, while the other constructs semantically anchored corrupted inputs to steer the masked reconstruction loss toward text-aligned critical areas. This approach effectively mitigates both semantic sparsity and supervisory redundancy, significantly enhancing generation fidelity and cross-modal alignment. Extensive experiments on GenEval, DPGBench, and CompBench demonstrate the method’s generality and superiority across diverse unified multimodal architectures.
📝 Abstract
Unified Multimodal Models (UMMs) have emerged as a promising paradigm that integrates multimodal understanding and generation within a unified modeling framework. However, current generative training paradigms suffer from inherent limitations. We present Semantically-Grounded Supervision (SeGroS), a fine-tuning framework designed to resolve the granularity mismatch and supervisory redundancy in UMMs. At its core, we propose a novel visual grounding map to construct two complementary supervision signals. First, we formulate semantic Visual Hints to compensate for the sparsity of text prompts. Second, we generate a semantically-grounded Corrupted Input to explicitly enhance the supervision of masking-based UMMs by restricting the reconstruction loss to core text-aligned regions. Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.