Semantic Agreement Enables Efficient Open-Ended LLM Cascades

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In open-ended text generation, cascaded systems struggle to reliably assess output quality in scenarios characterized by multiple valid solutions and continuous quality spectra. Traditional token-level confidence scores prove inadequate for capturing nuanced quality distributions across diverse, semantically equivalent responses. Method: We propose a training-free reliability criterion grounded in semantic consistency—replacing token-level confidence with embedding-based similarity measures—and design a black-box-compatible cascade routing mechanism that dynamically dispatches inputs between small and large models based on semantic agreement among ensemble outputs. Contribution/Results: Evaluated across models ranging from 500M to 70B parameters, our approach achieves target large-model quality while reducing computational cost by 40% and end-to-end latency by up to 60%. It requires no fine-tuning, exhibits cross-model and cross-version robustness, and significantly enhances the efficiency and practicality of open-domain cascaded generation systems.

Technology Category

Application Category

📝 Abstract
Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement -- meaning-level consensus between ensemble outputs -- as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.
Problem

Research questions and friction points this paper is trying to address.

Balancing cost and quality in LLM cascade systems
Determining output reliability in open-ended text generation
Using semantic agreement as training-free deferral signal
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses semantic agreement for deferral decisions
Matches target quality at 40% cost
Works across black-box APIs without internals
🔎 Similar Papers
No similar papers found.