An Empirical Study of Many-to-Many Summarization with Large Language Models

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the multilingual many-to-many summarization (M2MS) task—generating abstractive summaries in any target language from documents in any source language—while systematically mitigating factual inconsistency. To this end, we introduce the first unified M2MS benchmark, covering five domains, six languages, and 47.8K samples. We propose a lightweight instruction-tuning approach that enhances open-source LLMs’ cross-lingual summarization capability without compromising their general-purpose abilities, enabling them to surpass zero-shot GPT-4 on automated metrics. We further find that zero-shot open-source LLMs already match fine-tuned traditional models (e.g., mBART) in performance. Crucially, human evaluation of factual consistency reveals that, despite substantial gains in fluency and relevance, factual accuracy remains the primary bottleneck for current LLMs in cross-lingual summarization. This work establishes a new benchmark, advances methodology via efficient instruction tuning, and provides critical insights into the factual limitations of LLMs in multilingual abstractive summarization.

Technology Category

Application Category

📝 Abstract
Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs' M2MS ability. Specifically, we first reorganize M2MS data based on eight previous domain-specific datasets. The reorganized data contains 47.8K samples spanning five domains and six languages, which could be used to train and evaluate LLMs. Then, we benchmark 18 LLMs in a zero-shot manner and an instruction-tuning manner. Fine-tuned traditional models (e.g., mBART) are also conducted for comparisons. Our experiments reveal that, zero-shot LLMs achieve competitive results with fine-tuned traditional models. After instruct-tuning, open-source LLMs can significantly improve their M2MS ability, and outperform zero-shot LLMs (including GPT-4) in terms of automatic evaluations. In addition, we demonstrate that this task-specific improvement does not sacrifice the LLMs' general task-solving abilities. However, as revealed by our human evaluation, LLMs still face the factuality issue, and the instruction tuning might intensify the issue. Thus, how to control factual errors becomes the key when building LLM summarizers in real applications, and is worth noting in future research.
Problem

Research questions and friction points this paper is trying to address.

Study LLMs' ability in many-to-many multilingual summarization (M2MS).
Evaluate 18 LLMs using zero-shot and instruction-tuning approaches.
Address factuality issues in LLM-generated summaries post-instruction tuning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reorganizes M2MS data from eight datasets
Benchmarks 18 LLMs in zero-shot and instruction-tuning
Demonstrates instruction-tuning improves M2MS ability
🔎 Similar Papers
No similar papers found.