🤖 AI Summary
This work addresses the lack of mechanisms in existing intelligent healthcare systems for dynamically selecting the optimal specialized model across diverse clinical tasks. To this end, we propose ToolSelect, a query-aware model selector based on attentive neural processes that adaptively chooses the most suitable model from a heterogeneous pool of expert tools by modeling behavioral summaries of each specialist. We introduce the first agent-oriented chest X-ray evaluation environment and a new benchmark, ToolSelectBench, comprising 1,448 queries. Furthermore, we design a proxy optimization framework grounded in task-conditioned loss consistency. Extensive experiments across four major clinical task categories demonstrate that ToolSelect significantly outperforms ten state-of-the-art methods, validating its effectiveness and generalization capability.
📝 Abstract
Task-specialized models form the backbone of agentic healthcare systems, enabling the agents to answer clinical queries across tasks such as disease diagnosis, localization, and report generation. Yet, for a given task, a single"best"model rarely exists. In practice, each task is better served by multiple competing specialist models where different models excel on different data samples. As a result, for any given query, agents must reliably select the right specialist model from a heterogeneous pool of tool candidates. To this end, we introduce ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss. Concretely, we propose an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models. Motivated by the absence of any established testbed, we, for the first time, introduce an agentic Chest X-ray environment equipped with a diverse suite of task-specialized models (17 disease detection, 19 report generation, 6 visual grounding, and 13 VQA) and develop ToolSelectBench, a benchmark of 1448 queries. Our results demonstrate that ToolSelect consistently outperforms 10 SOTA methods across four different task families.