π€ AI Summary
There is currently a lack of publicly available benchmarks and systematic evaluation frameworks tailored for Composable AI. Method: We introduce CABENCH, the first dedicated benchmark for Composable AI, comprising 70 real-world complex tasks and 700 multimodal pretrained models. It supports comprehensive evaluation across task decomposition, model collaboration, and end-to-end execution. We design a multimodal model pool and a dual-path automated orchestration mechanism that jointly leverages human reference solutions and large language modelβdriven task decomposition, model selection, and workflow generation. Contribution/Results: Experiments demonstrate the practical efficacy of Composable AI in realistic scenarios and uncover a critical bottleneck in adaptive execution path generation. CABENCH establishes a foundation for standardized evaluation and automated pipeline research in Composable AI.
π Abstract
Composable AI offers a scalable and effective paradigm for tackling complex AI tasks by decomposing them into sub-tasks and solving each sub-task using ready-to-use well-trained models. However, systematically evaluating methods under this setting remains largely unexplored. In this paper, we introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks, along with a curated pool of 700 models across multiple modalities and domains. We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions. To establish initial baselines, we provide human-designed reference solutions and compare their performance with two LLM-based approaches. Our results illustrate the promise of composable AI in addressing complex real-world problems while highlighting the need for methods that can fully unlock its potential by automatically generating effective execution pipelines.