🤖 AI Summary
Large language models (LLMs) face a fundamental trade-off between high computational cost and low efficiency in automated machine learning (AutoML).
Method: This paper proposes a low-cost multi-agent system built upon the lightweight zero-shot base model Gemini-Pro, featuring a novel synergistic architecture that integrates expert role specialization, task-profile-driven dynamic routing, historical experience-enhanced retrieval, and LLM cascading (Gemini-Pro + GPT-4).
Contribution/Results: The system achieves a 10.23-percentage-point improvement in task success rate—reaching 32.95% on the MLAgentBench benchmark—while reducing per-run cost by 94.2% to $0.054, compared to a GPT-4-only single-agent baseline. Crucially, this work is the first to empirically demonstrate that a lightweight base model, when structured within a collaborative multi-agent framework, can outperform a stronger monolithic LLM. It establishes a scalable, efficient, and cost-effective paradigm for AutoML.
📝 Abstract
Large Language Models (LLMs) excel in diverse applications including generation of code snippets, but often struggle with generating code for complex Machine Learning (ML) tasks. Although existing LLM single-agent based systems give varying performance depending on the task complexity, they purely rely on larger and expensive models such as GPT-4. Our investigation reveals that no-cost and low-cost models such as Gemini-Pro, Mixtral and CodeLlama perform far worse than GPT-4 in a single-agent setting. With the motivation of developing a cost-efficient LLM based solution for solving ML tasks, we propose an LLM Multi-Agent based system which leverages combination of experts using profiling, efficient retrieval of past observations, LLM cascades, and ask-the-expert calls. Through empirical analysis on ML engineering tasks in the MLAgentBench benchmark, we demonstrate the effectiveness of our system, using no-cost models, namely Gemini as the base LLM, paired with GPT-4 in cascade and expert to serve occasional ask-the-expert calls for planning. With 94.2% reduction in the cost (from $0.931 per run cost averaged over all tasks for GPT-4 single agent system to $0.054), our system is able to yield better average success rate of 32.95% as compared to GPT-4 single-agent system yielding 22.72% success rate averaged over all the tasks of MLAgentBench.