Matryoshka Re-Ranker: A Flexible Re-Ranking Architecture With Configurable Depth and Width

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the computational bandwidth constraints and scenario diversity challenges in deploying LLM-based re-rankers, this paper proposes a dynamically configurable re-ranking architecture that supports runtime adaptation of both layer count and sequence length. Our method introduces two key innovations: (1) a novel cascaded self-distillation mechanism enabling multi-granularity knowledge transfer from larger to smaller models; and (2) a vertical–horizontal dual-path decoupled LoRA compensation framework, jointly mitigating accuracy degradation under arbitrary compression configurations (e.g., layer pruning, sequence truncation, or their combinations). Evaluated on MSMARCO and the full BEIR benchmark suite, our approach consistently outperforms existing state-of-the-art methods across diverse compression settings—achieving robust high accuracy while significantly improving inference efficiency and architectural flexibility.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) provide powerful foundations to perform fine-grained text re-ranking. However, they are often prohibitive in reality due to constraints on computation bandwidth. In this work, we propose a extbf{flexible} architecture called extbf{Matroyshka Re-Ranker}, which is designed to facilitate extbf{runtime customization} of model layers and sequence lengths at each layer based on users' configurations. Consequently, the LLM-based re-rankers can be made applicable across various real-world situations. The increased flexibility may come at the cost of precision loss. To address this problem, we introduce a suite of techniques to optimize the performance. First, we propose extbf{cascaded self-distillation}, where each sub-architecture learns to preserve a precise re-ranking performance from its super components, whose predictions can be exploited as smooth and informative teacher signals. Second, we design a extbf{factorized compensation mechanism}, where two collaborative Low-Rank Adaptation modules, vertical and horizontal, are jointly employed to compensate for the precision loss resulted from arbitrary combinations of layer and sequence compression. We perform comprehensive experiments based on the passage and document retrieval datasets from MSMARCO, along with all public datasets from BEIR benchmark. In our experiments, Matryoshka Re-Ranker substantially outperforms the existing methods, while effectively preserving its superior performance across various forms of compression and different application scenarios.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Sorting Accuracy
Flexibility and Processing Capacity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flexible Sorting System
Model Complexity Adjustment
Accuracy Compensation Mechanism