🤖 AI Summary
This work addresses the high computational cost and sequence-length inflation incurred when using large language models (LLMs) as text encoders with conventional in-context learning, which relies on extensive textual demonstrations. The authors propose EPIC, a novel approach that, for the first time, converts discrete textual demonstrations into learnable continuous embeddings and introduces an embedding-based contextual prompt training strategy. By leveraging contrastive learning, EPIC aligns semantically related text pairs and enables the model to interpret prompts represented as embeddings. This method substantially reduces both training and inference overhead while allowing the model to generate high-quality embeddings even without explicit contextual prompts. Evaluated on the MTEB benchmark under a setting that uses only publicly available retrieval data for training, EPIC achieves state-of-the-art performance.
📝 Abstract
Large language models (LLMs) have been widely explored for embedding generation. While recent studies show that in-context learning (ICL) effectively enhances the representational capability of LLMs by prepending a few task-related demonstrations, it causes substantial token overhead due to the increased sequence length. In this work, we propose EPIC, a novel embedding-based in-context prompt training strategy that leverages ICL to generate high-quality embeddings while reducing computational burden during both training and inference. This approach replaces discrete text demonstrations with their corresponding continuous embeddings, which not only encourages the LLM to align semantically-related text pairs during contrastive learning, but also requires the model to interpret demonstration embeddings as part of the in-context prompt. Consequently, EPIC-trained models achieve excellent embedding performance both with or without in-context prompts at inference time. Comprehensive experiments demonstrate that our method establishes new state-of-the-art results on the MTEB benchmark, surpassing frontier models trained solely on publicly available retrieval data. Extensive ablation studies further validate the effectiveness and necessity of our mechanism.