🤖 AI Summary
To address the low efficiency and poor training stability of large language models (LLMs) in domain adaptation under low-resource settings, this paper proposes MHA-RAG—a novel architecture that encodes retrieved examples as learnable soft prompts and employs multi-head attention to decouple their sequential dependencies, thereby enabling order-invariant unified prompt generation. By tightly integrating retrieval-augmented generation (RAG) with soft prompt optimization, MHA-RAG achieves enhanced generalization while preserving parameter efficiency. Experimental results demonstrate that MHA-RAG significantly reduces computational overhead: it improves accuracy by approximately 20 percentage points over standard RAG on multiple open-domain question-answering benchmarks, while reducing inference GFLOPs by an order of magnitude (10×). The method thus delivers superior efficiency, accuracy, and training stability—particularly critical for resource-constrained domain adaptation scenarios.
📝 Abstract
Adapting Foundation Models to new domains with limited training data is challenging and computationally expensive. While prior work has demonstrated the effectiveness of using domain-specific exemplars as in-context demonstrations, we investigate whether representing exemplars purely as text is the most efficient, effective, and stable approach. We explore an alternative: representing exemplars as soft prompts with an exemplar order invariant model architecture. To this end, we introduce Multi-Head Attention Retrieval-Augmented Generation (MHA-RAG), a framework with the number of attention heads serving as a simple hyperparameter to control soft prompt-generation across different tasks. Across multiple question-answering benchmarks and model scales, MHA-RAG achieves a 20-point performance gain over standard RAG, while cutting inference costs by a factor of 10X GFLOPs-delivering both higher accuracy and greater efficiency, invariant to exemplar order.