🤖 AI Summary
Large language and multimodal models incur prohibitive computational overhead and deployment costs during inference.
Method: This paper proposes the “Foundation Model Program” paradigm: compiling tasks into structured programs that invoke multiple foundation models at varying granularities, and learning input-adaptive, module-level model selection policies. The approach integrates program synthesis, reinforcement learning–driven scheduling optimization, and multimodal model orchestration, introducing a novel dynamic backend scheduling mechanism.
Contribution/Results: It enables fine-grained, task-aware co-optimization of computational resources and model capabilities, overcoming the rigid inference constraints of monolithic models. Evaluated on a new streaming visual question answering (VQA) benchmark, our method reduces computational resource consumption by 98% compared to state-of-the-art monolithic multimodal models, with less than 1% accuracy degradation—demonstrating exceptional cost-effectiveness and scalability.
📝 Abstract
The inference-time resource costs of large language and vision models present a growing challenge in production deployments. We propose the use of foundation model programs, i.e., programs that can invoke foundation models with varying resource costs and performance, as an approach to this problem. Specifically, we present a method that translates a task into a program, then learns a policy for resource allocation that, on each input, selects foundation model"backends"for each program module. The policy uses smaller, cheaper backends to handle simpler subtasks, while allowing more complex subtasks to leverage larger, more capable models. We evaluate the method on two new"streaming"visual question-answering tasks in which a system answers a question on a sequence of inputs, receiving ground-truth feedback after each answer. Compared to monolithic multi-modal models, our implementation achieves up to 98% resource savings with minimal accuracy loss, demonstrating its potential for scalable and resource-efficient multi-modal inference.