๐ค AI Summary
This work addresses the limitations of existing large language model (LLM) cascading systems that rely on static routing and struggle to adapt to dynamic task difficulty arising from tool failures, observation truncation, or error accumulation, leading to suboptimal trade-offs between cost and reliability. The authors propose R2V-Agent, a risk-calibrated collaborative framework between small language models (SLMs) and LLMs, featuring a step-level calibrated router that dynamically decides whether to escalate to an LLM based on residual failure risk. The router is jointly optimized using Brier-calibrated probabilities and a Conditional Value-at-Risk (CVaR) objective to enhance worst-case performance. The SLM policy integrates behavioral cloning, verifier-guided DPO, and consistency regularization. Evaluated on HumanEval+, TextWorld, and TerminalBench, R2V-Agent significantly improves the reliabilityโcost trade-off: achieving 94.3% success on HumanEval+ with only 0.60% LLM calls, boosting TextWorld success from 64.6% to 98.2%, and attaining 93.3% success on TerminalBench at roughly half the cost.
๐ Abstract
Efficient agentic systems should incur expensive frontier-model costs only on decisions where a cheaper local model is likely to fail. Existing LLM cascades usually route whole queries before execution, but task difficulty shifts mid-trajectory - after flaky tool calls, truncated observations, or compounding local errors - making pre-execution routing brittle. We introduce \textbf{R2V-Agent}, a risk-calibrated SLM-LLM routing framework for interactive agents. R2V combines four components: a distilled small language model (SLM) policy, a stronger teacher LLM, a lightweight process verifier that scores candidate actions at each step, and a calibrated step-level router. The router is our central contribution: after the SLM is trained, it estimates residual failure risk at each step and escalates only when teacher intervention is warranted. To make the routing problem well-defined, we first train a stable local SLM using a standard offline pipeline: behavioral cloning (BC) on teacher trajectories, followed by verifier-guided Direct Preference Optimization (DPO) with consistency regularization. The router is then trained on this fixed policy's residual failures using Brier-calibrated probability estimation and a Conditional Value-at-Risk (CVaR)-constrained objective that penalizes worst-case failures across perturbation seeds. Across HumanEval+, TextWorld, and TerminalBench with four SLM backbones, R2V improves the reliability-cost frontier: it achieves $94.3\%$ HumanEval+ success with $0.60\%$ LLM escalation, recovers TextWorld from $64.6\%$ SLM-only success to $98.2\%$ at $41.7\%$ escalation, and reaches $93.3\%$ TerminalBench success at $33.9\%$ LLM calls, roughly half the heuristic-router cost.