Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the limitations of existing dialogue summarization methods, which overly rely on surface-level metrics like ROUGE and struggle to balance factual consistency with human preferences. The authors propose a novel framework that integrates cognitive-style reasoning with preference alignment: first, a summarization model is initialized via supervised fine-tuning using structured reasoning trajectories; then, it is further optimized through GRPO-based reinforcement learning with a dual-principle reward function that jointly incorporates automatic metrics and human preferences—emphasizing coverage of key information, implicit reasoning, factual faithfulness, and conciseness. This approach uniquely combines explicit reasoning traces with preference alignment, achieving competitive ROUGE and BERTScore performance while significantly improving factual accuracy and alignment with human judgments on multilingual, multi-speaker dialogue benchmarks such as CSDS and SAMSum.

Technology Category

Application Category

📝 Abstract

Multi-role dialogue summarization requires modeling complex interactions among multiple speakers while preserving role-specific information and factual consistency. However, most existing methods optimize for automatic metrics such as ROUGE and BERTScore, which favor surface-level imitation of references rather than genuine gains in faithfulness or alignment with human preferences. We propose a novel framework that couples explicit cognitive-style reasoning with reward-based optimization for multi-role dialogue summarization. Our method first distills structured reasoning traces (e.g., step-by-step inferences and intermediate reflections) from a large teacher model and uses them as auxiliary supervision to initialize a reasoning-aware summarizer via staged supervised fine-tuning. It then applies GRPO with a dual-principle reward that blends metric-based signals with human-aligned criteria targeting key information coverage, implicit inference, factual faithfulness, and conciseness. Experiments on multilingual multi-role dialogue benchmarks show that our method matches strong baselines on ROUGE and BERTScore. Specifically, results on CSDS confirm the framework's stability in semantic consistency, while in-depth analysis on SAMSum demonstrates clear gains in factual faithfulness and model-based preference alignment. These findings underscore the value of reasoning-aware and preference-aware training for reliable dialogue summarization. Checkpoints and datasets are available at https://huggingface.co/collections/NebulaPixel/summorchestra-multirole-summary.

Problem

Research questions and friction points this paper is trying to address.

multi-role dialogue summarization

faithfulness

human preferences

automatic metrics

factual consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning-aware summarization

reward-based optimization

multi-role dialogue summarization