Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based embedding methods predominantly adopt encoder-only architectures, treating models as static feature extractors and thus struggling to capture deep semantic structures effectively. Method: This paper proposes GIRCSE, the first framework to integrate autoregressive generation with iterative contrastive optimization for text embedding. It generates soft token sequences and progressively refines semantic representations across iterations, overcoming limitations in implicit semantic modeling. The method introduces an iterative contrastive learning objective and identifies an emergent scaling phenomenon: extending generation length during inference consistently improves embedding quality. Contribution/Results: GIRCSE achieves significant gains over state-of-the-art encoder-based baselines on the MTEB benchmark and instruction-following tasks, demonstrating the effectiveness, scalability, and generalization superiority of the generative embedding paradigm.

Technology Category

Application Category

📝 Abstract
Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core generative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iteratively refine semantic representations. By producing sequences of soft tokens optimized under contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield better representations. Extensive experiments show that GIRCSE outperforms strong LLM-based embedding baselines on the MTEB benchmark and instruction-following tasks. Moreover, GIRCSE exhibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.
Problem

Research questions and friction points this paper is trying to address.

Generating text embeddings via iterative contrastive refinement
Capturing latent concepts missed by encoder-only LLM methods
Improving embedding quality through generative token generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative iterative refinement for contrastive sentence embeddings
Autoregressive generation of soft token sequences
Iterative contrastive refinement objective for representation learning
🔎 Similar Papers
No similar papers found.
Y
Yu-Che Tsai
Department of Computer Science, National Taiwan University, Taipei, Taiwan
K
Kuan-Yu Chen
Department of Computer Science, National Taiwan University, Taipei, Taiwan
Y
Yuan-Chi Li
Department of Computer Science, National Taiwan University, Taipei, Taiwan
Y
Yuan-Hao Chen
Department of Computer Science, National Taiwan University, Taipei, Taiwan
C
Ching-Yu Tsai
Department of Computer Science, National Taiwan University, Taipei, Taiwan
Shou-De Lin
Shou-De Lin
National Taiwan University
AImachine learningnatural language processing