CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels

πŸ“… 2024-12-03
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF

career value

173K/year
πŸ€– AI Summary
The scarcity of high-quality long-text summarization datasets for Chinese novels hinders progress in long-context summarization research. Method: We introduce CNNSumβ€”the first dedicated multi-scale benchmark for Chinese novel summarization (16k–128k characters), comprising 695 human-annotated samples. We propose a RoPE position encoding scaling strategy, combined with instruction tuning and multi-template prompt engineering, to enhance generalization from short-context training to long-context summarization. We further establish a fine-grained, cross-scale evaluation framework. Contributions/Results: Experiments reveal subjective biases in models like GPT-4o; lightweight base models, after targeted fine-tuning, outperform most commercial large language models on CNNSum; and CNNSum yields significantly more discriminative and reliable evaluation outcomes than existing benchmarks. All data and code are publicly released to advance research on Chinese long-text summarization.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) have been well-researched in various long-context tasks. However, the scarcity of high-quality long-context summarization datasets has hindered further advancements in this area. To address this, we introduce CNNSum, a multi-scale long-context summarization benchmark based on Chinese novels, featuring human-driven annotations, which comprises four subsets totaling 695 samples, with lengths ranging from 16k to 128k. We evaluate numerous LLMs and conduct detailed case analyses. Furthermore, we conduct extensive fine-tuning experiments to explore and improve long-context summarization. In our study: (1) Advanced LLMs like GPT-4o may still generate subjective commentary, leading to vague summaries. (2) Currently, long-context summarization mainly relies on memory ability afforded by longer context lengths. The advantages of Large LLMs are hard to utilize, thus small LLMs are the most cost-effective. (3) Different prompt templates paired with various version models may cause large performance gaps. In further fine-tuning, these can be mitigated, and the Base version models perform better. (4) LLMs with RoPE-base scaled exhibit strong extrapolation potential; using short-context data can significantly improve long-context summarization performance. However, further applying other interpolation methods requires careful selection. (5) CNNSum provides more reliable and insightful evaluation results than other benchmarks. We release CNNSum to advance future research in this field. https://github.com/CxsGhost/CNNSum
Problem

Research questions and friction points this paper is trying to address.

Lack of long-context summarization datasets in Chinese novels
Difficulty in utilizing Large LLMs effectively for summarization
Performance gaps due to varying prompt types and model versions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CNNSum for Chinese novel summarization
Explores RoPE-base scaling for long-context tasks
Benchmarks LLMs with human-driven annotations