Text-to-CAD Retrieval: a Strong Baseline

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the limitations of existing CAD model retrieval systems, which rely heavily on filenames or directory structures and struggle to interpret semantic queries. To overcome this, the paper formally defines, for the first time, the cross-modal retrieval task from text to CAD models and introduces a unified multimodal embedding framework. This framework jointly encodes procedural modeling sequences, geometric point clouds, and textual semantics, achieving implicit alignment through cross-attention mechanisms and masked feature reconstruction. During inference, auxiliary modules are removed to enhance computational efficiency. Evaluated on a newly curated Text2CAD dataset, the proposed method establishes the first practical benchmark, significantly outperforming baseline approaches and laying a foundation for downstream applications such as retrieval-augmented CAD generation.

📝 Abstract

Text-based retrieval of Computer-Aided Design (CAD) models is a critical yet underexplored task for the reuse of legacy industrial designs. Existing CAD repositories are typically searched using filenames or directories, which limits the efficiency, scalability, and accuracy of design retrieval. In this paper, we formally introduce text-to-CAD retrieval as a new cross-modal retrieval task, aiming to retrieve semantically relevant CAD models from large-scale databases given natural language queries. Leveraging paired text-CAD annotations from the Text2CAD dataset, we establish a practical benchmark for this task. To achieve text-based retrieval, we propose a unified framework that learns multi-modal CAD embeddings from both procedural sequences and geometric point clouds. Specifically, a sequence encoder captures the construction logic of CAD models, while a point encoder extracts explicit geometric features. A text encoder is used to learn semantic representations of textual queries. During training, we introduce a novel feature decoder that reconstructs masked sequence features via cross-attention with text and point features, encouraging implicit multi-modal alignment. At inference time, we remove this auxiliary decoder to enable efficient retrieval using concatenated sequence-point features. Our framework serves as a strong baseline for text-to-CAD retrieval and lays the foundation for downstream CAD generation paradigms, such as retrieval-augmented generation. The source code will be released.

Problem

Research questions and friction points this paper is trying to address.

text-to-CAD retrieval

cross-modal retrieval

CAD models

natural language queries

design reuse

Innovation

Methods, ideas, or system contributions that make the work stand out.

text-to-CAD retrieval

multi-modal embedding

procedural CAD sequences