LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing benchmarks for evaluating long-text generation either lack real-world scenario validation or oversimplify tasks, thereby failing to reflect practical complexity. Method: We propose Constraint-Verifier Evaluation (CoV-Eval), the first evaluation framework adopting a “goal-driven, constraint–verification” paradigm: it defines verifiable objectives grounded in authentic use cases, then reverse-engineers queries, source materials, and multi-dimensional constraints—balancing realism with assessability—and supports customized evaluation with inputs up to 64K tokens and outputs up to 8K tokens. Contribution/Results: Systematic evaluation across 23 state-of-the-art LLMs reveals persistent bottlenecks in factual consistency, coherence, and constraint adherence as real-world constraints tighten and output length increases. CoV-Eval establishes the first benchmark for long-text generation that simultaneously ensures authenticity, scalability, and reproducibility.

Technology Category

Application Category

📝 Abstract

Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce extbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.

Problem

Research questions and friction points this paper is trying to address.

Bridging real-world relevance and verifiability in long-form generation benchmarks

Assessing LLMs' capability to meet complex real-world constraints objectively

Evaluating model performance as output length and complexity increase

Innovation

Methods, ideas, or system contributions that make the work stand out.

Balances real-world relevance with verifiable assessment

Uses Constraint-Verifier Evaluation for objective task design

Supports customizable input/output lengths across seven tasks

🔎 Similar Papers

No similar papers found.