๐ค AI Summary
Deploying language-aligned vision foundation models on edge devices faces stringent latency and power constraints. This work proposes AdaVFM, an adaptive inference framework that, for the first time, integrates neural architecture search (NAS) into language-aligned vision foundation models and leverages a cloud-based multimodal large language model to enable context-aware runtime control, dynamically adjusting the computational load of lightweight subnetworks on the device. Evaluated on ImageNet-1K, AdaVFM achieves up to a 7.9% improvement in top-1 accuracy, and on ADE20K, it yields a 5.2% gain in mIoU. Under equivalent accuracy, the method reduces average FLOPs by up to 77.9%, demonstrating significant efficiency gains without compromising performance.
๐ Abstract
Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to $7.9\%$ in acc@1 on IN1K and $5.2\%$ mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to $77.9\%$.