CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

There is currently a lack of publicly available benchmarks and systematic evaluation frameworks tailored for Composable AI. Method: We introduce CABENCH, the first dedicated benchmark for Composable AI, comprising 70 real-world complex tasks and 700 multimodal pretrained models. It supports comprehensive evaluation across task decomposition, model collaboration, and end-to-end execution. We design a multimodal model pool and a dual-path automated orchestration mechanism that jointly leverages human reference solutions and large language model–driven task decomposition, model selection, and workflow generation. Contribution/Results: Experiments demonstrate the practical efficacy of Composable AI in realistic scenarios and uncover a critical bottleneck in adaptive execution path generation. CABENCH establishes a foundation for standardized evaluation and automated pipeline research in Composable AI.

Technology Category

Application Category

📝 Abstract

Composable AI offers a scalable and effective paradigm for tackling complex AI tasks by decomposing them into sub-tasks and solving each sub-task using ready-to-use well-trained models. However, systematically evaluating methods under this setting remains largely unexplored. In this paper, we introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks, along with a curated pool of 700 models across multiple modalities and domains. We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions. To establish initial baselines, we provide human-designed reference solutions and compare their performance with two LLM-based approaches. Our results illustrate the promise of composable AI in addressing complex real-world problems while highlighting the need for methods that can fully unlock its potential by automatically generating effective execution pipelines.

Problem

Research questions and friction points this paper is trying to address.

Lack of systematic evaluation for composable AI methods

Need for benchmark with diverse realistic composable tasks

Requirement for automated pipeline generation in composable AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CABENCH benchmark for composable AI

Proposes evaluation framework for AI solutions

Compares human and LLM-based solution performance

🔎 Similar Papers

No similar papers found.