🤖 AI Summary
This work addresses the limitation of current large language models in generating industrially relevant, complex CAD programs due to insufficient geometric diversity in their training data. Inspired by industrial design workflows, the authors propose a novel data augmentation approach that jointly conditions the generative process on both the modeling sequence and a reference surface, enabling large language models to produce parametric CAD programs incorporating spline-based organic geometric features. This method effectively compensates for the scarcity of organic shapes in existing open-source datasets, yielding synthetic samples that exhibit substantially greater geometric diversity and a higher proportion of organic structures—aligning more closely with real-world industrial design standards—and demonstrably enhance the training efficacy of downstream models.
📝 Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities in a wide range of code generation tasks. However, generating code for certain domains remains challenging. One such domain is Computer-Aided Design (CAD) program, where the goal is to produce scripted parametric models that define object geometry for precise design and manufacturing applications. A key challenge in LLM-based CAD program generation is the limited geometric complexity of generated shapes compared to those found in real-world industrial designs. This shortfall is in part due to the lack of diversity in the available CAD program training data. To address this, we propose a novel data augmentation paradigm that prompts an LLM to generate CAD programs conditioned on a reference surface program and a modeling procedure - an idea inspired by practices in industrial design. By varying the reference surface using a collection of organic shapes, our method enriches the geometric distribution of generated CAD models. In particular, it introduces edges and faces defined by spline-based curvature, which are typically missing or underrepresented in existing open-source CAD program datasets. Experiments show that our method produces CAD samples with significantly greater geometric diversity and a higher resemblance to industry-grade CAD designs in terms of the proportion of organic shape primitives. This enhancement makes our CAD data augmentation approach a useful tool for training LLMs and other deep learning models in CAD generation.