The Telephone Game: Evaluating Semantic Drift in Unified Models

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses semantic drift in unified multimodal models during iterative image-to-text (I2T) and text-to-image (T2I) cross-modal reasoning. We propose UCF-UM, the first systematic evaluation framework for this phenomenon. UCF-UM employs multi-round cross-modal generation cycles and introduces three novel metrics—Multimodal Cycle Distance (MCD), Semantic Drift Ratio (SDR), and Multimodal Generation Gap (MGG)—to quantify semantic consistency. To enable robust assessment beyond standard distributions, we construct ND400, a non-COCO benchmark. Methodologically, UCF-UM integrates embedding-space similarity computation, object-level fidelity extension of GenEval, and curated data from NoCaps and DOCCI. Experiments reveal substantial model disparities: BAGEL exhibits strong cycle stability, whereas models excelling unidirectionally (e.g., Vila-u) suffer rapid semantic drift. Our findings establish cyclic consistency as a critical axis for evaluating unified multimodal models, offering a new paradigm for assessing their reliability and trustworthiness.

Technology Category

Application Category

📝 Abstract

Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T, as consistency between understanding and generation is critical for downstream use. Existing evaluations consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These single-pass metrics do not reveal whether a model that understands a concept can also render it, nor whether meaning is preserved when cycling between image and text modalities. To address this, we introduce the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. UCF formulates 3 metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic loss; (ii) Semantic Drift Rate (SDR), that summarizes semantic decay rate; and (iii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO, which is widely used in training; we create a new benchmark ND400, sampled from NoCaps and DOCCI and evaluate on seven recent models. UCF-UM reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantics over many alternations, whereas others like Vila-u drift quickly despite strong single-pass scores. Our results highlight cyclic consistency as a necessary complement to standard I2T and T2I evaluations, and provide practical metrics to consistently assess unified model's cross-modal stability and strength of their shared representations. Code: https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models

Problem

Research questions and friction points this paper is trying to address.

Evaluating semantic consistency between image understanding and generation in unified models

Measuring meaning preservation when cycling between text and image modalities

Quantifying semantic drift across multiple generation cycles in VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cyclic evaluation protocol for semantic drift

Three metrics: MCD, SDR, MGG measurements

New benchmark ND400 for generalization assessment

🔎 Similar Papers

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models

2024-08-12arXiv.orgCitations: 5