MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing text-to-CAD methods, which are largely confined to single-part generation and lack effective evaluation of critical industrial design attributes such as functionality, manufacturability, and assemblability. To bridge this gap, we introduce the MUSE benchmark, which focuses on complex, editable B-Rep assemblies and establishes the first multidimensional evaluation framework tailored for engineering practicality. Our approach employs a three-stage protocol—comprising code inspection, geometric validation, and alignment with design intent—integrated with structured design specifications and an automated scoring mechanism powered by vision-language models, further validated through human assessment. Experimental results reveal a significant performance gap in current large language models across fine-grained engineering metrics, underscoring the importance of MUSE in advancing text-to-CAD systems toward real-world engineering applications.
📝 Abstract
Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.
Problem

Research questions and friction points this paper is trying to address.

Text-to-CAD
functionality
manufacturability
assemblability
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-CAD
B-Rep assemblies
manufacturability
assemblability
rubric-based evaluation
🔎 Similar Papers