Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Small language models suffer from limited performance, poor robustness, and insufficient adaptability, while existing knowledge distillation methods—such as static distillation, RLHF, and introspection—struggle to enable sustained improvement. To address this, we propose a multi-agent debate-driven lightweight knowledge distillation framework. Our approach introduces two key innovations: (1) a student–teacher multi-round collaborative debate mechanism that generates actionable error analyses and strategy-correction feedback; and (2) tree-structured direct preference optimization (T-DPO), which models debate logs as hierarchical decision paths to enhance feedback utilization efficiency and generalization. Evaluated across multiple NLP benchmarks, our method significantly improves accuracy, robustness, and generalization of small models—consistently outperforming static distillation, RLHF, and introspection baselines. These results demonstrate the effectiveness of efficient knowledge transfer under resource-constrained settings.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) continue to set new standards in knowledge-intensive and complex reasoning tasks, yet their high computational demands limit widespread adoption. While distilling large models into smaller ones offers a sustainable solution, current techniques--such as static knowledge distillation, resource-intensive reinforcement learning from human feedback, or limited self-reflection--struggle to yield substantial and lasting performance gains. In this paper, we present a novel Debate and Reflect (D&R) framework that orchestrates multi-turn debates between smaller models and stronger teacher models, eliciting actionable feedback (e.g., error analysis, corrective strategies) to guide student models. Further, we introduce Tree-structured Direct Preference Optimization (T-DPO) to efficiently leverage these debate logs, organizing interactions into a hierarchical format for effective training. Empirical evaluations across diverse NLP benchmarks demonstrate that our approach significantly improves smaller-model accuracy, robustness, and generalization, outperforming conventional baselines by a large margin.

Problem

Research questions and friction points this paper is trying to address.

Reduce computational demands of large language models

Improve small model performance via feedback

Optimize training efficiency with hierarchical interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent debate framework for feedback

Tree-structured preference optimization technique

Hierarchical interaction organization for training

🔎 Similar Papers

No similar papers found.