AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution

๐Ÿ“… 2026-04-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

238K/year
๐Ÿค– AI Summary
Deploying language-aligned vision foundation models on edge devices faces stringent latency and power constraints. This work proposes AdaVFM, an adaptive inference framework that, for the first time, integrates neural architecture search (NAS) into language-aligned vision foundation models and leverages a cloud-based multimodal large language model to enable context-aware runtime control, dynamically adjusting the computational load of lightweight subnetworks on the device. Evaluated on ImageNet-1K, AdaVFM achieves up to a 7.9% improvement in top-1 accuracy, and on ADE20K, it yields a 5.2% gain in mIoU. Under equivalent accuracy, the method reduces average FLOPs by up to 77.9%, demonstrating significant efficiency gains without compromising performance.

Technology Category

Application Category

๐Ÿ“ Abstract
Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to $7.9\%$ in acc@1 on IN1K and $5.2\%$ mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to $77.9\%$.
Problem

Research questions and friction points this paper is trying to address.

vision foundation models
edge intelligence
latency constraints
power constraints
on-device inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive inference
vision foundation models
neural architecture search
multimodal LLM
edge intelligence
๐Ÿ”Ž Similar Papers
No similar papers found.