DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

In open-domain dialogue evaluation, large language models (LLMs) exhibit unstable discrimination under ambiguous scenarios, while small language models (SLMs) are vulnerable to adversarial inputs. To address this, we propose the Dual Refinement Evaluation (DRE) framework—the first approach enabling SLMs to fully guide and dynamically calibrate LLM-based evaluation. DRE operates in two stages: first, the SLM generates prompt-driven preliminary assessments to constrain the LLM’s output space; second, it adaptively recalibrates the LLM’s scores via a bias-aware mechanism. By synergistically integrating SLM robustness with LLM reasoning capability, DRE employs adaptive weighted ensemble and prompt engineering. Extensive experiments across multiple benchmarks demonstrate that DRE significantly improves alignment with human judgments, achieving an average 12.7% gain in Kendall’s τ over state-of-the-art methods. This work establishes a more reliable and interpretable collaborative evaluation paradigm for open-domain dialogue assessment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) excel at many tasks but struggle with ambiguous scenarios where multiple valid responses exist, often yielding unreliable results. Conversely, Small Language Models (SLMs) demonstrate robustness in such scenarios but are susceptible to misleading or adversarial inputs. We observed that LLMs handle negative examples effectively, while SLMs excel with positive examples. To leverage their complementary strengths, we introduce SLIDE (Small and Large Integrated for Dialogue Evaluation), a method integrating SLMs and LLMs via adaptive weighting. Building on SLIDE, we further propose a Dual-Refinement Evaluation (DRE) method to enhance SLM-LLM integration: (1) SLM-generated insights guide the LLM to produce initial evaluations; (2) SLM-derived adjustments refine the LLM's scores for improved accuracy. Experiments demonstrate that DRE outperforms existing methods, showing stronger alignment with human judgment across diverse benchmarks. This work illustrates how combining small and large models can yield more reliable evaluation tools, particularly for open-ended tasks such as dialogue evaluation.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with ambiguous dialogue scenarios

SLMs are vulnerable to misleading inputs

Integrating SLMs and LLMs improves evaluation reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates SLMs and LLMs via adaptive weighting

Uses SLM insights to guide LLM evaluations

Refines LLM scores with SLM-derived adjustments

🔎 Similar Papers

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems