🤖 AI Summary
Large language model (LLM) deployment faces challenges of high computational overhead and low resource utilization; existing cascading approaches struggle to jointly optimize small-model confidence and task capability. This paper proposes the Gatekeeper loss function—the first method enabling joint calibration of confidence and task performance—supporting dynamic trade-offs between accuracy and rejection rate without architectural modifications. Our approach integrates confidence-aware loss optimization, a lightweight active routing mechanism for small models, and cross-architecture adaptation (supporting encoder-only, decoder-only, and encoder-decoder models). Evaluated on image classification, language modeling, and vision-language tasks, it significantly improves hard-example identification accuracy, reduces LLM invocation rates by 15–40%, and maintains end-to-end accuracy—thereby achieving superior inference efficiency without compromising overall performance.
📝 Abstract
Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy, and is broadly applicable across various tasks and domains without any architectural changes. We evaluate our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.