AIDE: Agentically Improve Visual Language Model with Domain Experts

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

State-of-the-art vision-language models (VLMs) rely heavily on knowledge distillation from increasingly larger teacher models, creating a fundamental bottleneck: “no stronger teacher, no evolution.” Method: We propose AIDE (Agent-based Improvement via Domain Experts), the first autonomous four-stage enhancement paradigm: (1) domain expert agent identification, (2) on-demand invocation of specialized models, (3) synthesis of multi-source expert outputs, and (4) dynamic injection of augmented data into training. AIDE requires neither larger VLMs nor human annotations. Contribution/Results: AIDE enables self-driven capability expansion of VLMs under zero human supervision and zero dependence on stronger teachers. It achieves significant performance gains across major multimodal benchmarks—including MMMU, MME, and MMBench—demonstrating, for the first time, scalable, self-evolving VLMs without external supervision or teacher-model reliance.

Technology Category

Application Category

📝 Abstract

The enhancement of Visual Language Models (VLMs) has traditionally relied on knowledge distillation from larger, more capable models. This dependence creates a fundamental bottleneck for improving state-of-the-art systems, particularly when no superior models exist. We introduce AIDE (Agentic Improvement through Domain Experts), a novel framework that enables VLMs to autonomously enhance their capabilities by leveraging specialized domain expert models. AIDE operates through a four-stage process: (1) identifying instances for refinement, (2) engaging domain experts for targeted analysis, (3) synthesizing expert outputs with existing data, and (4) integrating enhanced instances into the training pipeline. Experiments on multiple benchmarks, including MMMU, MME, MMBench, etc., demonstrate AIDE's ability to achieve notable performance gains without relying on larger VLMs nor human supervision. Our framework provides a scalable, resource-efficient approach to continuous VLM improvement, addressing critical limitations in current methodologies, particularly valuable when larger models are unavailable to access.

Problem

Research questions and friction points this paper is trying to address.

Enhance Visual Language Models autonomously

Leverage domain expert models for improvement

Achieve performance gains without larger models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages domain expert models

Autonomous VLM capability enhancement

Four-stage refinement process

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling