Automated Model Discovery via Multi-modal & Multi-step Pipeline

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing automated model discovery methods struggle to simultaneously achieve fine-grained modeling, strong generalization, and low model complexity. To address this, we propose a dual-module collaborative framework grounded in vision-language models (VLMs), comprising an AnalyzerVLM and an EvaluatorVLM. Leveraging multimodal perception and multi-step reasoning, the framework enables autonomous modeling analysis and joint evaluation within complex search spaces. It synergistically optimizes local fidelity and global generalization by integrating quantitative metrics with human-perceptual judgments. Experiments demonstrate that our method significantly improves model discovery quality, achieving a superior balance between high-detail fidelity and strong out-of-distribution generalization across multiple benchmarks. Ablation studies confirm that multimodal representation learning and stepwise reasoning are critical drivers of performance gains.

Technology Category

Application Category

📝 Abstract

Automated model discovery is the process of automatically searching and identifying the most appropriate model for a given dataset over a large combinatorial search space. Existing approaches, however, often face challenges in balancing the capture of fine-grained details with ensuring generalizability beyond training data regimes with a reasonable model complexity. In this paper, we present a multi-modal & multi-step pipeline for effective automated model discovery. Our approach leverages two vision-language-based modules (VLM), AnalyzerVLM and EvaluatorVLM, for effective model proposal and evaluation in an agentic way. AnalyzerVLM autonomously plans and executes multi-step analyses to propose effective candidate models. EvaluatorVLM assesses the candidate models both quantitatively and perceptually, regarding the fitness for local details and the generalibility for overall trends. Our results demonstrate that our pipeline effectively discovers models that capture fine details and ensure strong generalizability. Additionally, extensive ablation studies show that both multi-modality and multi-step reasoning play crucial roles in discovering favorable models.

Problem

Research questions and friction points this paper is trying to address.

Automated model discovery balances fine details and generalizability

Multi-modal pipeline uses vision-language modules for model proposal

Multi-step reasoning evaluates candidate models quantitatively and perceptually

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal pipeline uses vision-language modules

AnalyzerVLM plans multi-step model proposals

EvaluatorVLM assesses models quantitatively and perceptually

🔎 Similar Papers

No similar papers found.