A Survey of Long-Document Retrieval in the PLM and LLM Era

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Long-document retrieval faces core challenges including excessive length, scattered evidence, and complex structural organization. This paper systematically surveys the evolution of long-document retrieval techniques across the pre-trained model and large language model (LLM) eras, unifying three developmental stages—classical passage retrieval, hierarchical encoding with efficient attention mechanisms, and LLM-driven re-ranking—for the first time. We propose a unified technical taxonomy encompassing key paradigms such as passage aggregation, structure-aware encoding, and retrieval-augmented generation, alongside domain-specific evaluation resources. Furthermore, we identify critical open problems in the foundation model era: the efficiency–effectiveness trade-off, cross-modal semantic alignment, and result interpretability. This work establishes an authoritative, structured survey framework and a comprehensive roadmap for long-document information retrieval research.

Technology Category

Application Category

📝 Abstract

The proliferation of long-form documents presents a fundamental challenge to information retrieval (IR), as their length, dispersed evidence, and complex structures demand specialized methods beyond standard passage-level techniques. This survey provides the first comprehensive treatment of long-document retrieval (LDR), consolidating methods, challenges, and applications across three major eras. We systematize the evolution from classical lexical and early neural models to modern pre-trained (PLM) and large language models (LLMs), covering key paradigms like passage aggregation, hierarchical encoding, efficient attention, and the latest LLM-driven re-ranking and retrieval techniques. Beyond the models, we review domain-specific applications, specialized evaluation resources, and outline critical open challenges such as efficiency trade-offs, multimodal alignment, and faithfulness. This survey aims to provide both a consolidated reference and a forward-looking agenda for advancing long-document retrieval in the era of foundation models.

Problem

Research questions and friction points this paper is trying to address.

Addressing information retrieval challenges from long documents with dispersed evidence

Developing specialized methods beyond standard passage-level retrieval techniques

Overcoming efficiency and structural complexity issues in long-document retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Passage aggregation and hierarchical encoding

Efficient attention mechanisms for long documents

LLM-driven re-ranking and retrieval techniques

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval

2024-03-27arXiv.orgCitations: 9

Authors to Follow