A Survey of Long-Document Retrieval in the PLM and LLM Era

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-document retrieval faces core challenges including excessive length, scattered evidence, and complex structural organization. This paper systematically surveys the evolution of long-document retrieval techniques across the pre-trained model and large language model (LLM) eras, unifying three developmental stages—classical passage retrieval, hierarchical encoding with efficient attention mechanisms, and LLM-driven re-ranking—for the first time. We propose a unified technical taxonomy encompassing key paradigms such as passage aggregation, structure-aware encoding, and retrieval-augmented generation, alongside domain-specific evaluation resources. Furthermore, we identify critical open problems in the foundation model era: the efficiency–effectiveness trade-off, cross-modal semantic alignment, and result interpretability. This work establishes an authoritative, structured survey framework and a comprehensive roadmap for long-document information retrieval research.

Technology Category

Application Category

📝 Abstract
The proliferation of long-form documents presents a fundamental challenge to information retrieval (IR), as their length, dispersed evidence, and complex structures demand specialized methods beyond standard passage-level techniques. This survey provides the first comprehensive treatment of long-document retrieval (LDR), consolidating methods, challenges, and applications across three major eras. We systematize the evolution from classical lexical and early neural models to modern pre-trained (PLM) and large language models (LLMs), covering key paradigms like passage aggregation, hierarchical encoding, efficient attention, and the latest LLM-driven re-ranking and retrieval techniques. Beyond the models, we review domain-specific applications, specialized evaluation resources, and outline critical open challenges such as efficiency trade-offs, multimodal alignment, and faithfulness. This survey aims to provide both a consolidated reference and a forward-looking agenda for advancing long-document retrieval in the era of foundation models.
Problem

Research questions and friction points this paper is trying to address.

Addressing information retrieval challenges from long documents with dispersed evidence
Developing specialized methods beyond standard passage-level retrieval techniques
Overcoming efficiency and structural complexity issues in long-document retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Passage aggregation and hierarchical encoding
Efficient attention mechanisms for long documents
LLM-driven re-ranking and retrieval techniques
M
Minghan Li
School of Computer Science and Technology, Soochow University, China
M
Miyang Luo
School of Computer Science and Technology, Soochow University, China
T
Tianrui Lv
School of Computer Science and Technology, Soochow University, China
Y
Yishuai Zhang
School of Computer Science and Technology, Soochow University, China
S
Siqi Zhao
School of Computer Science and Technology, Soochow University, China
Ercong Nie
Ercong Nie
LMU Munich, MCML
Computational LinguisticsNatural Language Processing
Guodong Zhou
Guodong Zhou
Soochow University, China
Natural Language ProcessingArtificial Intelligence