🤖 AI Summary
To address the low credibility of long-context query-focused summarization—stemming from incomplete evidence extraction and positional bias (notably “middle omission”) in large language models (LLMs)—this paper introduces **unstructured evidence attribution**, a novel task. Methodologically, we construct SUnsET, the first domain-agnostic, synthetically controllable dataset for this task, and propose an LLM-based synthetic annotation paradigm integrating supervised fine-tuning, evidence span extraction and alignment, and multi-scale attention analysis. Experiments across five LLMs and four heterogeneous datasets demonstrate significant improvements in evidence relevance and factual consistency, more uniform coverage of evidence positions (mitigating middle omission), and comprehensive enhancement of summary quality.
📝 Abstract
Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query. Extracting and properly citing evidence spans could help improve the transparency and reliability of these summaries. At the same time, LLMs suffer from positional biases in terms of which information they understand and attend to, which could affect evidence citation. Whereas previous work has focused on evidence citation with predefined levels of granularity (e.g. sentence, paragraph, document, etc.), we propose the task of long-context query focused summarization with unstructured evidence citation. We show how existing systems struggle to generate and properly cite unstructured evidence from their context, and that evidence tends to be"lost-in-the-middle". To help mitigate this, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel domain-agnostic pipeline which can be used as supervision to adapt LLMs to this task. We demonstrate across 5 LLMs of different sizes and 4 datasets with varying document types and lengths that LLMs adapted with SUnsET data generate more relevant and factually consistent evidence than their base models, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries.