Multi$^2$: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the performance limitations of large language models (LLMs) in multi-document summarization (MDS). We propose a test-time scalable multi-agent reasoning framework: diverse prompts generate candidate summaries, which are then fused by a consistency-aware aggregator to produce high-quality outputs. Our contributions include (i) the first test-time multi-prompt ensembling paradigm tailored for natural language generation (NLG); and (ii) two novel evaluation metrics—Consistency-Aware Preference (CAP) scoring and Atomic Content Unit (ACU) assessment—which respectively model summary consistency and mitigate positional bias. Experiments demonstrate that our method significantly outperforms strong baselines across multiple MDS benchmarks. Moreover, we provide the first systematic analysis of test-time scaling in summarization, empirically identifying its effectiveness boundary and the point of diminishing returns.

Technology Category

Application Category

📝 Abstract

Recent advances in test-time scaling have shown promising results in improving Large Language Models (LLMs) performance through strategic computation allocation during inference. While this approach has demonstrated strong performance improvements in logical and mathematical reasoning tasks, its application to natural language generation (NLG), especially summarization, has yet to be explored. Multi-Document Summarization (MDS) is a challenging task that focuses on extracting and synthesizing useful information from multiple lengthy documents. Unlike reasoning tasks, MDS requires a more nuanced approach to prompt design and ensemble, as there is no"best"prompt to satisfy diverse summarization requirements. To address this, we propose a novel framework that leverages inference-time scaling for this task. Precisely, we take prompt ensemble approach by leveraging various prompt to first generate candidate summaries and then ensemble them with an aggregator to produce a refined summary. We also introduce two new evaluation metrics: Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (ACU) score, to enhance LLM's contextual understanding while mitigating its positional bias. Extensive experiments demonstrate the effectiveness of our approach in improving summary quality while identifying and analyzing the scaling boundaries in summarization tasks.

Problem

Research questions and friction points this paper is trying to address.

Explores test-time scaling for multi-document summarization tasks.

Proposes a prompt ensemble approach to generate refined summaries.

Introduces new metrics to enhance contextual understanding and reduce bias.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages inference-time scaling for summarization

Uses prompt ensemble to generate refined summaries

Introduces CAP and ACU metrics for evaluation

🔎 Similar Papers

Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents