Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address ambiguous semantic boundaries in 3D functional segmentation caused by point cloud sparsity, noise, and geometric ambiguity, this paper proposes the first 3D manipulability segmentation framework leveraging semantic priors from 2D vision foundation models. Methodologically, it introduces (1) a semantic anchoring learning paradigm that aligns image-level 2D semantic knowledge to 3D geometric space; (2) a cross-modal affinity transfer (CMAT) pretraining strategy jointly optimizing point cloud reconstruction, cross-modal affinity, and feature diversity; and (3) the CAST Transformer architecture for adaptive fusion of multimodal prompts and 3D features. Evaluated on multiple standard benchmarks, the framework achieves state-of-the-art performance, significantly improving semantic consistency and boundary precision of segmentation results. This demonstrates the effective transfer of 2D semantic knowledge to 3D functional understanding.

Technology Category

Application Category

📝 Abstract
Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.
Problem

Research questions and friction points this paper is trying to address.

Addressing sparse noisy 3D data limitations in affordance segmentation
Transferring 2D semantic knowledge to improve 3D functional boundaries
Integrating multi-modal prompts for precise affordance segmentation maps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transfer 2D semantic knowledge to 3D domain
Align 3D encoder with 2D semantics via CMAT
Integrate multi-modal prompts with pretrained features
🔎 Similar Papers
No similar papers found.
Y
Yu Huang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiaotong University
Zelin Peng
Zelin Peng
Shanghai Jiao Tong University
Computer VisionMedical Image Processing
Changsong Wen
Changsong Wen
sjtu
computer vision
X
Xiaokang Yang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiaotong University
W
Wei Shen
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiaotong University