Mordal: Automated Pretrained Model Selection for Vision Language Models

๐Ÿ“… 2025-02-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the inefficiency and manual dependency in visual-language model (VLM) selection, this paper proposes the first automated VLM model search framework. Our method introduces a task-aware model similarity metric, a lightweight surrogate evaluator, and a multi-stage candidate pruning strategyโ€”jointly ensuring high evaluation fidelity while drastically reducing computational overhead. Compared to exhaustive grid search, our approach reduces GPU-hour consumption by 8.9โ€“11.6ร—. It accurately identifies optimal VLM configurations across diverse downstream domains, achieving new state-of-the-art performance on multiple metrics. The core contribution is an end-to-end, efficient, and generalizable VLM selection paradigm that balances scalability with reproducibility, providing a systematic, deployable solution for practical VLM integration.

Technology Category

Application Category

๐Ÿ“ Abstract
Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using up to $8.9 imes$--$11.6 imes$ lower GPU hours than grid search. In the process of our evaluation, we have also discovered new VLMs that outperform their state-of-the-art counterparts.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Automatic Model Selection
Task-specific Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mordal System
Automatic Model Selection
Visual Language Models
๐Ÿ”Ž Similar Papers
No similar papers found.