🤖 AI Summary
This work addresses semantic drift in unified multimodal models during iterative image-to-text (I2T) and text-to-image (T2I) cross-modal reasoning. We propose UCF-UM, the first systematic evaluation framework for this phenomenon. UCF-UM employs multi-round cross-modal generation cycles and introduces three novel metrics—Multimodal Cycle Distance (MCD), Semantic Drift Ratio (SDR), and Multimodal Generation Gap (MGG)—to quantify semantic consistency. To enable robust assessment beyond standard distributions, we construct ND400, a non-COCO benchmark. Methodologically, UCF-UM integrates embedding-space similarity computation, object-level fidelity extension of GenEval, and curated data from NoCaps and DOCCI. Experiments reveal substantial model disparities: BAGEL exhibits strong cycle stability, whereas models excelling unidirectionally (e.g., Vila-u) suffer rapid semantic drift. Our findings establish cyclic consistency as a critical axis for evaluating unified multimodal models, offering a new paradigm for assessing their reliability and trustworthiness.
📝 Abstract
Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T, as consistency between understanding and generation is critical for downstream use. Existing evaluations consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These single-pass metrics do not reveal whether a model that understands a concept can also render it, nor whether meaning is preserved when cycling between image and text modalities. To address this, we introduce the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. UCF formulates 3 metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic loss; (ii) Semantic Drift Rate (SDR), that summarizes semantic decay rate; and (iii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO, which is widely used in training; we create a new benchmark ND400, sampled from NoCaps and DOCCI and evaluate on seven recent models. UCF-UM reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantics over many alternations, whereas others like Vila-u drift quickly despite strong single-pass scores. Our results highlight cyclic consistency as a necessary complement to standard I2T and T2I evaluations, and provide practical metrics to consistently assess unified model's cross-modal stability and strength of their shared representations. Code: https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models