MLLM-Driven Semantic Identifier Generation for Generative Cross-Modal Retrieval

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing generative cross-modal retrieval methods rely on handcrafted IDs, clustering labels, or atomic identifiers requiring vocabulary expansion, struggling to simultaneously achieve semantic alignment and scalability. This paper proposes the Generative Structured Semantic Identifier (SSI) framework, reformulating cross-modal retrieval as a multimodal large language model (MLLM)-driven generation task. It employs prompt learning to generate concept-level, structured textual identifiers and introduces Reasoning-Guided Supervision (RGS), leveraging explanatory sentences as auxiliary supervision signals to explicitly model image-text semantic correspondences and mitigate hallucination—without vocabulary expansion. Evaluated on multiple standard benchmarks, our method significantly improves accuracy and robustness in text-to-image retrieval, demonstrating the effectiveness and scalability of the generative paradigm for cross-modal semantic alignment.

Technology Category

Application Category

📝 Abstract
Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or scalability.To address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model's generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training.
Problem

Research questions and friction points this paper is trying to address.

Improving semantic alignment in generative cross-modal retrieval systems
Addressing scalability limitations of existing identifier generation methods
Reducing hallucinations during training through better supervision strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates structured semantic identifiers using MLLMs
Uses concept-level tokens for natural alignment
Introduces rationale-guided supervision to reduce hallucinations
🔎 Similar Papers
No similar papers found.
Tianyuan Li
Tianyuan Li
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011 and University of Chinese Academy of Sciences, No. 19(A) Yuquan Road, Shijingshan, Beijing, China, 100049, and Xinjiang Laboratory of Minority Speech and Language Information Processing, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011
L
Lei Wang
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011 and University of Chinese Academy of Sciences, No. 19(A) Yuquan Road, Shijingshan, Beijing, China, 100049, and Xinjiang Laboratory of Minority Speech and Language Information Processing, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011
A
Ahtamjan Ahmat
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011 and University of Chinese Academy of Sciences, No. 19(A) Yuquan Road, Shijingshan, Beijing, China, 100049, and Xinjiang Laboratory of Minority Speech and Language Information Processing, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011
Y
Yating Yang
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011 and University of Chinese Academy of Sciences, No. 19(A) Yuquan Road, Shijingshan, Beijing, China, 100049, and Xinjiang Laboratory of Minority Speech and Language Information Processing, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011
B
Bo Ma
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011 and University of Chinese Academy of Sciences, No. 19(A) Yuquan Road, Shijingshan, Beijing, China, 100049, and Xinjiang Laboratory of Minority Speech and Language Information Processing, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011
Rui Dong
Rui Dong
Ph.D. candidate, University of Michigan
program synthesisformal methodsprogram verification
B
Bangju Han
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011 and University of Chinese Academy of Sciences, No. 19(A) Yuquan Road, Shijingshan, Beijing, China, 100049, and Xinjiang Laboratory of Minority Speech and Language Information Processing, 40-1 Beijing Road, Urumqi, Xinjiang, China, 830011