CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models

πŸ“… 2025-08-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
There is currently a lack of publicly available benchmarks and systematic evaluation frameworks tailored for Composable AI. Method: We introduce CABENCH, the first dedicated benchmark for Composable AI, comprising 70 real-world complex tasks and 700 multimodal pretrained models. It supports comprehensive evaluation across task decomposition, model collaboration, and end-to-end execution. We design a multimodal model pool and a dual-path automated orchestration mechanism that jointly leverages human reference solutions and large language model–driven task decomposition, model selection, and workflow generation. Contribution/Results: Experiments demonstrate the practical efficacy of Composable AI in realistic scenarios and uncover a critical bottleneck in adaptive execution path generation. CABENCH establishes a foundation for standardized evaluation and automated pipeline research in Composable AI.

Technology Category

Application Category

πŸ“ Abstract
Composable AI offers a scalable and effective paradigm for tackling complex AI tasks by decomposing them into sub-tasks and solving each sub-task using ready-to-use well-trained models. However, systematically evaluating methods under this setting remains largely unexplored. In this paper, we introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks, along with a curated pool of 700 models across multiple modalities and domains. We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions. To establish initial baselines, we provide human-designed reference solutions and compare their performance with two LLM-based approaches. Our results illustrate the promise of composable AI in addressing complex real-world problems while highlighting the need for methods that can fully unlock its potential by automatically generating effective execution pipelines.
Problem

Research questions and friction points this paper is trying to address.

Lack of systematic evaluation for composable AI methods
Need for benchmark with diverse realistic composable tasks
Requirement for automated pipeline generation in composable AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CABENCH benchmark for composable AI
Proposes evaluation framework for AI solutions
Compares human and LLM-based solution performance
πŸ”Ž Similar Papers
No similar papers found.
T
Tung-Thuy Pham
Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
D
Duy-Quan Luong
Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
M
Minh-Quan Duong
Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
T
Trung-Hieu Nguyen
Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
Thu-Trang Nguyen
Thu-Trang Nguyen
VNU University of Engineering and Technology
Automated Software EngineeringProgram AnalysisCode GenerationAI
S
Son Nguyen
Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
Hieu Dinh Vo
Hieu Dinh Vo
VNU
Software architectureProgram analysis