T-TAMER: Provably Taming Trade-offs in ML Serving

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-model machine learning serving faces a fundamental multi-objective trade-off among accuracy, latency, and resource consumption—particularly challenging in cascade early-exit scenarios, where existing approaches lack theoretical guarantees. Method: This paper proposes T-Tamer, a unified framework that models serving decisions as a multi-stage stochastic optimization problem. It establishes, for the first time, that “backtracking capability” is both necessary and sufficient for achieving provably Pareto-optimal solutions, and accordingly designs a polynomial-time algorithm. Results: Experiments on vision and NLP early-exit tasks demonstrate that T-Tamer significantly outperforms state-of-the-art heuristic strategies along the accuracy–latency Pareto frontier. Its convergence is theoretically guaranteed, making it the first general-purpose optimization framework for cascade model serving with rigorous theoretical foundations.

Technology Category

Application Category

📝 Abstract
As machine learning models continue to grow in size and complexity, efficient serving faces increasingly broad trade-offs spanning accuracy, latency, resource usage, and other objectives. Multi-model serving further complicates these trade-offs; for example, in cascaded models, each early-exit decision balances latency reduction against potential accuracy loss. Despite the pervasiveness and importance of such trade-offs, current strategies remain largely heuristic and case-specific, limiting both their theoretical guarantees and general applicability. We present a general framework, T-Tamer, which formalizes this setting as a multi-stage decision process, where the objective is to determine both when to exit and which model to consult. Our main result shows that recall (i.e., the ability to revisit earlier models) is both necessary and sufficient for achieving provable performance guarantees. In particular, we prove that strategies without recall cannot obtain any constant-factor approximation to the optimal trade-off, whereas recall-based strategies provably attain the optimal trade-off in polynomial time. We validate our analysis through experiments on synthetic datasets and early-exit workloads for vision and NLP benchmarks. The results show that recall-based strategies consistently yield efficient accuracy-latency trade-offs. We hope this work provides a principled foundation for bridging heuristic practice with theoretical guarantees in the design of early-exit and cascaded models.
Problem

Research questions and friction points this paper is trying to address.

Formalizing ML serving trade-offs as multi-stage decision processes
Proving recall is necessary for optimal accuracy-latency guarantees
Providing theoretical foundations for early-exit cascaded model design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes multi-stage decision process for model serving
Proves recall enables provable performance guarantees
Achieves optimal accuracy-latency trade-offs in polynomial time
🔎 Similar Papers
No similar papers found.
Y
Yuanyuan Yang
Department of Computer Science & Engineering, University of Washington
Ruimin Zhang
Ruimin Zhang
Department of Computer Science, University of Chicago
Jamie Morgenstern
Jamie Morgenstern
University of Washington
Algorithmic game theorymachine learningprivacyapproximation algorithms
H
Haifeng Xu
Department of Computer Science, University of Chicago