LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Large language models (e.g., GPT-4o) suffer from substantial length bias and marked degradation in content quality during long-text generation. To address this, we propose a process-supervised, critique-augmented stepwise preference modeling framework. Departing from conventional outcome-based supervision, our method employs Monte Carlo Tree Search to construct stepwise preference pairs, integrating external critique feedback and a global memory pool to ensure coherence and consistency throughout generation—enabling fine-grained, controllable long-text synthesis. Experiments demonstrate significant improvements in length accuracy and content quality on long-text benchmarks, while preserving near-lossless performance on general-purpose benchmarks. The framework is model-agnostic and compatible with diverse backbone architectures.

Technology Category

Application Category

📝 Abstract

Long-form generation is crucial for academic writing papers and repo-level code generation. Despite this, current models, including GPT-4o, still exhibit unsatisfactory performance. Existing methods that utilize preference learning with outcome supervision often fail to provide detailed feedback for extended contexts. This shortcoming can lead to content that does not fully satisfy query requirements, resulting in issues like length deviations, and diminished quality. In this paper, we propose enhancing long-form generation by incorporating process supervision. We employ Monte Carlo Tree Search to gather stepwise preference pairs, utilizing a global memory pool to maintain consistency. To address the issue of suboptimal candidate selection, we integrate external critiques to refine and improve the quality of the preference pairs. Finally, we apply step-level DPO using the collected stepwise preference pairs. Experimental results show that our method improves length and quality on long-form generation benchmarks, with almost lossless performance on general benchmarks across various model backbones.

Problem

Research questions and friction points this paper is trying to address.

Enhance long-form generation for LLMs

Improve detailed feedback for extended contexts

Optimize candidate selection with external critiques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monte Carlo Tree Search

global memory pool

step-level DPO

🔎 Similar Papers

No similar papers found.