Resource-efficient Inference with Foundation Model Programs

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language and multimodal models incur prohibitive computational overhead and deployment costs during inference. Method: This paper proposes the “Foundation Model Program” paradigm: compiling tasks into structured programs that invoke multiple foundation models at varying granularities, and learning input-adaptive, module-level model selection policies. The approach integrates program synthesis, reinforcement learning–driven scheduling optimization, and multimodal model orchestration, introducing a novel dynamic backend scheduling mechanism. Contribution/Results: It enables fine-grained, task-aware co-optimization of computational resources and model capabilities, overcoming the rigid inference constraints of monolithic models. Evaluated on a new streaming visual question answering (VQA) benchmark, our method reduces computational resource consumption by 98% compared to state-of-the-art monolithic multimodal models, with less than 1% accuracy degradation—demonstrating exceptional cost-effectiveness and scalability.

Technology Category

Application Category

📝 Abstract
The inference-time resource costs of large language and vision models present a growing challenge in production deployments. We propose the use of foundation model programs, i.e., programs that can invoke foundation models with varying resource costs and performance, as an approach to this problem. Specifically, we present a method that translates a task into a program, then learns a policy for resource allocation that, on each input, selects foundation model"backends"for each program module. The policy uses smaller, cheaper backends to handle simpler subtasks, while allowing more complex subtasks to leverage larger, more capable models. We evaluate the method on two new"streaming"visual question-answering tasks in which a system answers a question on a sequence of inputs, receiving ground-truth feedback after each answer. Compared to monolithic multi-modal models, our implementation achieves up to 98% resource savings with minimal accuracy loss, demonstrating its potential for scalable and resource-efficient multi-modal inference.
Problem

Research questions and friction points this paper is trying to address.

Reducing inference-time resource costs for large models
Optimizing resource allocation in foundation model programs
Balancing performance and efficiency in multi-modal tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Foundation model programs for resource-efficient inference
Dynamic policy selects optimal model backends per subtask
Achieves 98% resource savings with minimal accuracy loss
🔎 Similar Papers
No similar papers found.
Lunyiu Nie
Lunyiu Nie
University of Texas at Austin
Agent OrchestrationProgramming LanguagesNatural Language Processing
Z
Zhimin Ding
Rice University
K
Kevin Yu
The University of Texas at Austin
M
Marco Cheung
The University of Texas at Austin
C
Christopher Jermaine
Rice University
Swarat Chaudhuri
Swarat Chaudhuri
UT Austin, Google DeepMind
Automated ReasoningMachine LearningFormal MethodsProgramming Languages