EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis, Auditing, and Repair

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of high-quality, domain-specific training data in existing function-calling agent benchmarks, which often lack coverage of executable environments, databases, and multi-turn interaction trajectories and suffer from systematic errors. To this end, we propose the first end-to-end self-evolving multi-agent data platform, orchestrated by an EigenCore coordinator that integrates DatabaseAgent, CodingAgent, and DataAgent. The platform automatically synthesizes, audits, and repairs function-calling data through a cross-component consistency feedback mechanism, combining iterative test-driven debugging with prompt optimization. We further introduce a result-aware evaluation protocol grounded in database state correctness. Experiments demonstrate that our approach effectively corrects functional pattern, implementation, and trajectory errors in the BFCL-V3 benchmark, significantly improving model rankings and their alignment with human judgments of functional correctness.

Technology Category

Application Category

📝 Abstract
Function-calling agents -- large language models that invoke tools and APIs -- require high-quality, domain-specific training data spanning executable environments, backing databases, and diverse multi-turn trajectories. We introduce EigenData, an integrated, self-evolving platform that automates the full data lifecycle through a multi-agent architecture. A top-level orchestrator, EigenCore, coordinates three specialized sub-systems: DatabaseAgent for realistic domain database construction, CodingAgent for verified executable environment generation with iterative test-debug loops, and DataAgent for multi-turn trajectory synthesis with self-evolving prompt optimization. Cross-component feedback ensures consistency across all artifacts. We apply EigenData to audit and repair the Berkeley Function-Calling Leaderboard (BFCL-V3), identifying systematic errors in function schemas, implementations, and reference trajectories, automatically correcting them through coordinated schema refinement, code-level bug fixes, and trajectory modification, and introducing an outcome-aware evaluation protocol that assesses task success via database-state correctness rather than turn-level trajectory matching. We demonstrate that the repaired benchmark, coupled with outcome-aware metrics, produces model rankings substantially better correlated with human judgments of functional correctness.
Problem

Research questions and friction points this paper is trying to address.

function-calling agents
training data synthesis
benchmark auditing
multi-turn trajectories
evaluation metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving multi-agent
function-calling data synthesis
outcome-aware evaluation
automated benchmark repair
executable environment generation
🔎 Similar Papers
No similar papers found.