🤖 AI Summary
Addressing two key challenges in multilingual generative retrieval—cross-lingual token misalignment and token inflation—this paper proposes a cross-lingual semantic compression framework. It achieves semantic alignment of multilingual keywords via shared atomic representation learning and introduces a dynamic multi-step constrained decoding strategy to jointly optimize semantic consistency and decoding efficiency within a unified, low-dimensional token space. This work is the first to deeply integrate cross-lingual semantic alignment with token-space compression, thereby enhancing retrieval robustness and generation controllability. On the mMARCO-100k and mNQ-320k benchmarks, the method improves retrieval accuracy by 6.83% and 4.77%, respectively, while reducing average token length by 74.51% and 78.2%. These results demonstrate its comprehensive advantages in accuracy, efficiency, and compactness.
📝 Abstract
Generative Information Retrieval is an emerging retrieval paradigm that exhibits remarkable performance in monolingual scenarios.However, applying these methods to multilingual retrieval still encounters two primary challenges, cross-lingual identifier misalignment and identifier inflation. To address these limitations, we propose Multilingual Generative Retrieval via Cross-lingual Semantic Compression (MGR-CSC), a novel framework that unifies semantically equivalent multilingual keywords into shared atoms to align semantics and compresses the identifier space, and we propose a dynamic multi-step constrained decoding strategy during retrieval. MGR-CSC improves cross-lingual alignment by assigning consistent identifiers and enhances decoding efficiency by reducing redundancy. Experiments demonstrate that MGR-CSC achieves outstanding retrieval accuracy, improving by 6.83% on mMarco100k and 4.77% on mNQ320k, while reducing document identifiers length by 74.51% and 78.2%, respectively.