Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing

📅 2024-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Balancing accuracy and computational cost in large language model (LLM)-based code completion remains challenging due to their inherent trade-off during inference. Method: This paper proposes a black-box cascaded inference framework that dynamically selects an optimal sequence of models and self-testing intensity at inference time, guided by self-generated unit tests and coordinated multi-model scheduling. Contribution/Results: It achieves the first cost–accuracy Pareto optimization for self-testing code generation across multiple operating points. We design a threshold-driven cascaded decision mechanism and a budget-aware heuristic resource allocation strategy—both fully black-box, requiring no access to model internals. Evaluated across diverse LLM families and benchmark datasets, our approach reduces average computational cost by 26% (up to 70%) while maintaining or exceeding the generation accuracy of single-model self-testing. The framework enables elastic, real-time configuration of accuracy–cost trade-offs in production services.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has significantly improved code completion tasks, yet the trade-off between accuracy and computational cost remains a critical challenge. While using larger models and incorporating inference-time self-testing algorithms can significantly improve output accuracy, they incur substantial computational expenses at the same time. Furthermore, servers in real-world scenarios usually have a dynamic preference on the cost-accuracy tradeoff, depending on the budget, bandwidth, the concurrent user volume, and users' sensitivity to wrong answers. In this work, we introduce a novel framework combining model cascading and inference-time self-feedback algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff in LLM-based code generation. Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions. As a blackbox inference-time method, it requires no access to internal model parameters. We further propose a threshold-based algorithm to determine when to deploy larger models and a heuristic to optimize the number of solutions, test cases, and test lines generated per model, based on budget constraints. Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case, across various model families and datasets, while maintaining or improving accuracy in natural language generation tasks compared to both random and optimal single-model self-testing schemes. To our knowledge, this is the first work to provide a series of choices for optimizing the cost-accuracy trade-off in LLM code generation with self-testing.
Problem

Research questions and friction points this paper is trying to address.

Optimizes cost-accuracy tradeoff in LLM code generation
Introduces model cascading with self-testing algorithms
Reduces computational costs while maintaining accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model cascading for cost efficiency
Self-testing enhances accuracy
Threshold-based algorithm optimizes model usage
🔎 Similar Papers
No similar papers found.