Markov-Enhanced Clustering for Long Document Summarization: Tackling the 'Lost in the Middle' Challenge with Large Language Models

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the “lost-in-the-middle” problem—where large language models (LLMs) fail to retain critical intermediate information in long-document summarization—this paper proposes an extractive-generative hybrid framework. Methodologically, the document is first partitioned into chunks, and sentence embeddings are obtained and clustered to identify core thematic units. A Markov chain graph model is then constructed to explicitly capture semantic ordering constraints, guiding the logical sequencing of clustered key segments. Finally, an LLM generates a coherent, temporally ordered summary from the structured input. Our key contribution is the novel integration of Markov chains into hybrid summarization to drive principled, structure-aware ranking of salient passages. Experiments on multiple long-document benchmarks demonstrate statistically significant improvements in ROUGE scores over both pure generative and conventional extractive-generative baselines, confirming enhanced factual completeness and semantic coherence. (149 words)

Technology Category

Application Category

📝 Abstract

The rapid expansion of information from diverse sources has heightened the need for effective automatic text summarization, which condenses documents into shorter, coherent texts. Summarization methods generally fall into two categories: extractive, which selects key segments from the original text, and abstractive, which generates summaries by rephrasing the content coherently. Large language models have advanced the field of abstractive summarization, but they are resourceintensive and face significant challenges in retaining key information across lengthy documents, which we call being "lost in the middle". To address these issues, we propose a hybrid summarization approach that combines extractive and abstractive techniques. Our method splits the document into smaller text chunks, clusters their vector embeddings, generates a summary for each cluster that represents a key idea in the document, and constructs the final summary by relying on a Markov chain graph when selecting the semantic order of ideas.

Problem

Research questions and friction points this paper is trying to address.

Addresses 'lost in the middle' issue in long document summarization

Combines extractive and abstractive techniques for better summaries

Uses clustering and Markov chains to organize key ideas

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid extractive and abstractive summarization approach

Clusters vector embeddings of text chunks

Uses Markov chain for semantic order selection

🔎 Similar Papers

No similar papers found.