MLLM-Driven Semantic Identifier Generation for Generative Cross-Modal Retrieval

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing generative cross-modal retrieval methods rely on handcrafted IDs, clustering labels, or atomic identifiers requiring vocabulary expansion, struggling to simultaneously achieve semantic alignment and scalability. This paper proposes the Generative Structured Semantic Identifier (SSI) framework, reformulating cross-modal retrieval as a multimodal large language model (MLLM)-driven generation task. It employs prompt learning to generate concept-level, structured textual identifiers and introduces Reasoning-Guided Supervision (RGS), leveraging explanatory sentences as auxiliary supervision signals to explicitly model image-text semantic correspondences and mitigate hallucination—without vocabulary expansion. Evaluated on multiple standard benchmarks, our method significantly improves accuracy and robustness in text-to-image retrieval, demonstrating the effectiveness and scalability of the generative paradigm for cross-modal semantic alignment.

Technology Category

Application Category

📝 Abstract

Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or scalability.To address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model's generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training.

Problem

Research questions and friction points this paper is trying to address.

Improving semantic alignment in generative cross-modal retrieval systems

Addressing scalability limitations of existing identifier generation methods

Reducing hallucinations during training through better supervision strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates structured semantic identifiers using MLLMs

Uses concept-level tokens for natural alignment

Introduces rationale-guided supervision to reduce hallucinations

🔎 Similar Papers

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling