CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels

📅 2024-12-03

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

162K/year

🤖 AI Summary

The scarcity of high-quality long-text summarization datasets for Chinese novels hinders progress in long-context summarization research. Method: We introduce CNNSum—the first dedicated multi-scale benchmark for Chinese novel summarization (16k–128k characters), comprising 695 human-annotated samples. We propose a RoPE position encoding scaling strategy, combined with instruction tuning and multi-template prompt engineering, to enhance generalization from short-context training to long-context summarization. We further establish a fine-grained, cross-scale evaluation framework. Contributions/Results: Experiments reveal subjective biases in models like GPT-4o; lightweight base models, after targeted fine-tuning, outperform most commercial large language models on CNNSum; and CNNSum yields significantly more discriminative and reliable evaluation outcomes than existing benchmarks. All data and code are publicly released to advance research on Chinese long-text summarization.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have been well-researched in various long-context tasks. However, the scarcity of high-quality long-context summarization datasets has hindered further advancements in this area. To address this, we introduce CNNSum, a multi-scale long-context summarization benchmark based on Chinese novels, featuring human-driven annotations, which comprises four subsets totaling 695 samples, with lengths ranging from 16k to 128k. We evaluate numerous LLMs and conduct detailed case analyses. Furthermore, we conduct extensive fine-tuning experiments to explore and improve long-context summarization. In our study: (1) Advanced LLMs like GPT-4o may still generate subjective commentary, leading to vague summaries. (2) Currently, long-context summarization mainly relies on memory ability afforded by longer context lengths. The advantages of Large LLMs are hard to utilize, thus small LLMs are the most cost-effective. (3) Different prompt templates paired with various version models may cause large performance gaps. In further fine-tuning, these can be mitigated, and the Base version models perform better. (4) LLMs with RoPE-base scaled exhibit strong extrapolation potential; using short-context data can significantly improve long-context summarization performance. However, further applying other interpolation methods requires careful selection. (5) CNNSum provides more reliable and insightful evaluation results than other benchmarks. We release CNNSum to advance future research in this field. https://github.com/CxsGhost/CNNSum

Problem

Research questions and friction points this paper is trying to address.

Lack of long-context summarization datasets in Chinese novels

Difficulty in utilizing Large LLMs effectively for summarization

Performance gaps due to varying prompt types and model versions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CNNSum for Chinese novel summarization

Explores RoPE-base scaling for long-context tasks

Benchmarks LLMs with human-driven annotations

🔎 Similar Papers

LaMSUM: Amplifying Voices Against Harassment through LLM Guided Extractive Summarization of User Incident Reports