🤖 AI Summary
Existing generative cross-modal retrieval methods rely on handcrafted IDs, clustering labels, or atomic identifiers requiring vocabulary expansion, struggling to simultaneously achieve semantic alignment and scalability. This paper proposes the Generative Structured Semantic Identifier (SSI) framework, reformulating cross-modal retrieval as a multimodal large language model (MLLM)-driven generation task. It employs prompt learning to generate concept-level, structured textual identifiers and introduces Reasoning-Guided Supervision (RGS), leveraging explanatory sentences as auxiliary supervision signals to explicitly model image-text semantic correspondences and mitigate hallucination—without vocabulary expansion. Evaluated on multiple standard benchmarks, our method significantly improves accuracy and robustness in text-to-image retrieval, demonstrating the effectiveness and scalability of the generative paradigm for cross-modal semantic alignment.
📝 Abstract
Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or scalability.To address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model's generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training.