CLIPPER: Compression enables long-context synthetic data generation

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

High-quality synthetic data for long-text narrative claim verification remains scarce. Method: This paper proposes a hierarchical compression-guided synthetic data generation paradigm: (1) two-level compression of books into chapter outlines and global summaries; (2) two-stage generation of complex claims and corresponding chain-of-thought (CoT) rationales conditioned on the compressed representations—thereby mitigating factual hallucination inherent in direct generation from raw text. Contribution/Results: We construct the first large-scale (19K instances) synthetic dataset for narrative claim verification, featuring source texts and explicit CoTs. Integrated with hierarchical compression, multi-stage LLM prompting, and progressive fine-tuning, our approach significantly boosts performance of open-source small-to-medium models (<10B parameters): verification accuracy rises from 28% to 76%, establishing new SOTA on the NoCha leaderboard, while also improving downstream performance on NarrativeQA and yielding more detailed, source-grounded reasoning chains.

Technology Category

Application Category

📝 Abstract

LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).

Problem

Research questions and friction points this paper is trying to address.

Generates high-quality synthetic data

Improves narrative claim verification

Compresses books for better reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compression-based synthetic data generation

Chapter outlines enhance claim validity

Chain-of-thought reasoning improves accuracy

🔎 Similar Papers

No similar papers found.