🤖 AI Summary
This work addresses the inefficiency and poor transferability of existing large language model (LLM) text embedding methods, which require full retraining when changing backbone architectures. The authors propose PromptEmbedder, the first approach to decouple embedding knowledge from backbone weights by employing a dual-LLM architecture: task-specific knowledge is encoded into a dedicated Prompting LLM via differentiable soft prompt generation and continuous relaxation mechanisms. By freezing the embedding LLM and introducing only a lightweight linear alignment matrix, the method efficiently adapts to new backbones. Evaluated on the MTEB benchmark, PromptEmbedder matches the performance of LoRA fine-tuning while reducing GPU memory usage by 40% and accelerating training by 3.7×, substantially enhancing cross-architecture scalability and generalization.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face significant bottlenecks in computational efficiency and cross-architecture transferability. Whenever a new backbone emerges, existing approaches require costly retraining from scratch. To address this, we propose PromptEmbedder, a novel dual-LLM framework that decouples embedding knowledge from specific backbone weights. PromptEmbedder utilizes a Prompting LLM to generate instruction-aware soft prompts for a frozen Embedding LLM via a differentiable generation process with continuous relaxation, ensuring full gradient flow during contrastive training. By localizing task-specific knowledge within the Prompting LLM, adapting to new architectures requires only retraining a lightweight linear alignment matrix. Evaluations on the MTEB benchmark show that PromptEmbedder achieves comparable performance with LoRA finetuning while reducing GPU memory by 40% and accelerating training by 3.7x. Our approach establishes a scalable, architecture-agnostic paradigm for efficient LLM-based representation learning.