Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

πŸ“… 2026-02-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes a novel heterogeneous multi-agent framework to overcome the limitations of prevailing static, homogeneous agent architectures, which fail to effectively harness the complementary capabilities of post-trained models. The framework introduces a coordinator that dynamically schedules domain-specialized tool agents and integrates a self-evaluation protocol with a calibration mechanism to enable test-time collaborative reasoning grounded in the agents’ capability disparities. For the first time, this approach supports on-demand activation of heterogeneous agents, achieving substantial performance gains over homogeneous baselines across five benchmarks. Notably, it attains 96.67% accuracy on AIME24 and 72.53% on LiveCodeBench, demonstrating its effectiveness in leveraging diverse model competencies for complex reasoning tasks.

Technology Category

Application Category

πŸ“ Abstract
Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.
Problem

Research questions and friction points this paper is trying to address.

Multi-Agent Systems
heterogeneous agents
model coordination
post-training skills
tool calling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Team-of-Thoughts
heterogeneous multi-agent systems
orchestrator-tool paradigm
self-assessment protocol
dynamic agent selection
πŸ”Ž Similar Papers
No similar papers found.