🤖 AI Summary
This work addresses the challenge of generating executable CAD programs from images, where aligning visual geometry with symbolic representations remains difficult and existing methods lack robustness on complex designs. The authors propose GIFT, a novel framework that, for the first time, incorporates geometric feedback from test-time search back into the training phase. GIFT enables self-bootstrapping data augmentation through soft rejection sampling (GIFT-REJECT) and failure-driven augmentation (GIFT-FAIL), without requiring additional annotations or specialized architectures. Compared to strong supervised baselines, GIFT improves average IoU by 12% while reducing inference computational overhead by 80%, achieving performance on par with significantly more complex multimodal systems.
📝 Abstract
Generating executable CAD programs from images requires alignment between visual geometry and symbolic program representations, a capability that current methods fail to learn reliably as design complexity increases. Existing fine-tuning approaches rely on either limited supervised datasets or expensive post-training pipelines, resulting in brittle systems that restrict progress in generative CAD design. We argue that the primary bottleneck lies not in model or algorithmic capacity, but in the scarcity of diverse training examples that align visual geometry with program syntax. This limitation is especially acute because the collection of diverse and verified engineering datasets is both expensive and difficult to scale, constraining the development of robust generative CAD models. We introduce Geometric Inference Feedback Tuning (GIFT), a data augmentation framework that leverages geometric feedback to turn test-time compute into a bootstrapped set of high-quality training samples. GIFT combines two mechanisms: Soft-Rejection Sampling (GIFT-REJECT), which retains diverse high-fidelity programs beyond exact ground-truth matches, and Failure-Driven Augmentation (GIFT-FAIL), which converts near-miss predictions into synthetic training examples that improve robustness on challenging geometries. By amortizing inference-time search into the model parameters, GIFT captures the benefits of test-time scaling while reducing inference compute by 80%. It improves mean IoU by 12% over a strong supervised baseline and remains competitive with more complex multimodal systems, without requiring additional human annotation or specialized architectures.