Making Large Language Models Efficient Dense Retrievers

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of large language models (LLMs) in dense retrieval. We first uncover a task-specific inter-layer redundancy: MLP layers exhibit substantial pruning potential, whereas attention layers are critical for retrieval performance. Leveraging this insight, we propose EffiR—a retrieval-oriented two-stage compression framework comprising coarse-grained depth pruning (removing redundant MLP layers) and fine-grained width compression (structured neuron pruning within MLPs). EffiR integrates layer-wise redundancy analysis, joint pruning, and retrieval-specific fine-tuning. Evaluated on the BEIR benchmark, EffiR achieves up to 72% parameter reduction and significant inference speedup while fully preserving the retrieval accuracy of the full-sized LLM. Our approach establishes a new paradigm for efficient LLM-based dense retrieval, demonstrating that task-aware architectural compression can eliminate computational overhead without sacrificing effectiveness.

Technology Category

Application Category

📝 Abstract
Recent work has shown that directly fine-tuning large language models (LLMs) for dense retrieval yields strong performance, but their substantial parameter counts make them computationally inefficient. While prior studies have revealed significant layer redundancy in LLMs for generative tasks, it remains unclear whether similar redundancy exists when these models are adapted for retrieval tasks, which require encoding entire sequences into fixed representations rather than generating tokens iteratively. To this end, we conduct a comprehensive analysis of layer redundancy in LLM-based dense retrievers. We find that, in contrast to generative settings, MLP layers are substantially more prunable, while attention layers remain critical for semantic aggregation. Building on this insight, we propose EffiR, a framework for developing efficient retrievers that performs large-scale MLP compression through a coarse-to-fine strategy (coarse-grained depth reduction followed by fine-grained width reduction), combined with retrieval-specific fine-tuning. Across diverse BEIR datasets and LLM backbones, EffiR achieves substantial reductions in model size and inference cost while preserving the performance of full-size models.
Problem

Research questions and friction points this paper is trying to address.

Analyzes layer redundancy in LLM-based dense retrievers
Proposes a framework to compress MLP layers for efficiency
Reduces model size and cost while maintaining retrieval performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLP layer pruning for efficiency
Coarse-to-fine compression strategy
Retrieval-specific fine-tuning preserves performance
🔎 Similar Papers
No similar papers found.