Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image generation models exhibit fragility when handling compositional structural constraints, such as object count, attribute binding, and part relationships. This work proposes SoT, a visual chain-of-thought framework that leverages a unified multimodal autoregressive model to alternately generate textual planning steps and intermediate rendered states, enabling progressive shape assembly without external engines. SoT introduces a transparent, process-supervised generation paradigm driven by visual chain-of-thought reasoning, capturing assembly logic without relying on explicit geometric representations. To support this approach, we construct the SoT-26K dataset based on CAD part hierarchies and the T2S-CompBench evaluation benchmark, incorporating 2D projection consistency constraints. Experiments show that the fine-tuned model achieves 88.4% accuracy in component count and 84.8% in structural topology, representing an improvement of approximately 20% over text-only baselines.

Technology Category

Application Category

📝 Abstract
Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints-notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework that enables progressive shape assembly via coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. SoT establishes a new paradigm for transparent, process-supervised compositional generation. The code is available at https://anonymous.4open.science/r/16FE/. The SoT-26K dataset will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

compositional generation
generative numeracy
attribute binding
part-level relations
structural constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual chain-of-thought
progressive object assembly
multimodal autoregressive model
compositional generation
shape assembly logic
🔎 Similar Papers
No similar papers found.
Y
Yu Huo
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen; The Shenzhen Institute of Artificial Intelligence and Robotics for Society; Guangdong Provincial Key Laboratory of Future Networks of Intelligence
Siyu Zhang
Siyu Zhang
4DV.ai
Computer Vision
Kun Zeng
Kun Zeng
Dongfang Electric Corporation Dongfang Boiler Co.,ltd.
magnetic domainNDEmagnetismmagnetic microstructureboiler
Haoyue Liu
Haoyue Liu
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Computer VisionEvent Camera
O
Owen Lee
School of Data Science, The Chinese University of Hong Kong, Shenzhen
J
Junlin Chen
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen
Y
Yuquan Lu
Guangdong Provincial Key Laboratory of Future Networks of Intelligence
Y
Yifu Guo
Sun Yat-sen University
Y
Yaodong Liang
The Hong Kong University of Science and Technology, Guangzhou
X
Xiaoying Tang
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen; The Shenzhen Institute of Artificial Intelligence and Robotics for Society; Guangdong Provincial Key Laboratory of Future Networks of Intelligence