🤖 AI Summary
Existing flow-matching approaches for 3D shape assembly lack explicit modeling of inter-part interaction relationships. This work proposes a topology-aware representation alignment framework that, for the first time, integrates topological structure alignment into 3D assembly by distilling relational structures from a frozen pre-trained 3D encoder. The method enhances spatial reasoning through geometric descriptors and similarity alignment—leveraging cosine-similarity-based token alignment and Centered Kernel Alignment loss. Experiments reveal that geometric and contact characteristics, rather than semantic labels, predominantly govern alignment efficacy. The proposed approach achieves state-of-the-art performance across five benchmarks, accelerates convergence by up to 6.9× without additional inference overhead, and significantly improves both in-distribution accuracy and cross-domain zero-shot transfer capability.
📝 Abstract
Flow-matching methods for 3D shape assembly learn point-wise velocity fields that transport parts toward assembled configurations, yet they receive no explicit guidance about which cross-part interactions should drive the motion. We introduce TORA, a topology-first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow-matching backbone during training. We first realize this via simple instantiation, token-wise cosine matching, which injects the learned geometric descriptors from the teacher representation. We then extend to employ a Centered Kernel Alignment (CKA) loss to match the similarity structure between student and teacher representations for enhanced topological alignment. Through systematic probing of diverse 3D encoders, we show that geometry- and contact-centric teacher properties, not semantic classification ability, govern alignment effectiveness, and that alignment is most beneficial at later transformer layers where spatial structure naturally emerges. TORA introduces zero inference overhead while yielding two consistent benefits: faster convergence (up to 6.9$\times$) and improved accuracy in-distribution, along with greater robustness under domain shift. Experiments on five benchmarks spanning geometric, semantic, and inter-object assembly demonstrate state-of-the-art performance, with particularly pronounced gains in zero-shot transfer to unseen real-world and synthetic datasets. Project page: https://nahyuklee.github.io/tora.