Resource-efficient Inference with Foundation Model Programs

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Large language and multimodal models incur prohibitive computational overhead and deployment costs during inference. Method: This paper proposes the “Foundation Model Program” paradigm: compiling tasks into structured programs that invoke multiple foundation models at varying granularities, and learning input-adaptive, module-level model selection policies. The approach integrates program synthesis, reinforcement learning–driven scheduling optimization, and multimodal model orchestration, introducing a novel dynamic backend scheduling mechanism. Contribution/Results: It enables fine-grained, task-aware co-optimization of computational resources and model capabilities, overcoming the rigid inference constraints of monolithic models. Evaluated on a new streaming visual question answering (VQA) benchmark, our method reduces computational resource consumption by 98% compared to state-of-the-art monolithic multimodal models, with less than 1% accuracy degradation—demonstrating exceptional cost-effectiveness and scalability.

Technology Category

Application Category

📝 Abstract

The inference-time resource costs of large language and vision models present a growing challenge in production deployments. We propose the use of foundation model programs, i.e., programs that can invoke foundation models with varying resource costs and performance, as an approach to this problem. Specifically, we present a method that translates a task into a program, then learns a policy for resource allocation that, on each input, selects foundation model"backends"for each program module. The policy uses smaller, cheaper backends to handle simpler subtasks, while allowing more complex subtasks to leverage larger, more capable models. We evaluate the method on two new"streaming"visual question-answering tasks in which a system answers a question on a sequence of inputs, receiving ground-truth feedback after each answer. Compared to monolithic multi-modal models, our implementation achieves up to 98% resource savings with minimal accuracy loss, demonstrating its potential for scalable and resource-efficient multi-modal inference.

Problem

Research questions and friction points this paper is trying to address.

Reducing inference-time resource costs for large models

Optimizing resource allocation in foundation model programs

Balancing performance and efficiency in multi-modal tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Foundation model programs for resource-efficient inference

Dynamic policy selects optimal model backends per subtask

Achieves 98% resource savings with minimal accuracy loss

🔎 Similar Papers

No similar papers found.